# V - Feature Extraction of test images

- Now that we have trained and evaluated our models on a dataset with known labels, we
must try to predict the bug type for data with unknown labels. 

- In order to apply our
supervised method on the given test data, we obviously had to redo all steps we have
described previously. 

- Therefore, we will first generate our features based on the images.

## A - Loading images and computing features

#### Setting up

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from tools import *
import math
pd.set_option("display.max_colwidth",None)
pd.set_option("display.max_rows",None)
pd.set_option("display.max_columns",None)
pd.options.display.precision = 5
plt.rcParams.update(plt.rcParamsDefault)
sns.set_style('white')
import cv2

In [2]:
import datetime
from time import process_time

# Extra library imports
from PIL import Image
from scipy.optimize import minimize
import scipy.ndimage as ndi
from scipy.stats import skew, kurtosis

In [13]:
path = 'test/'
path2 = 'data/'
data = pd.read_csv(path2 + "test_manual_labels.csv", sep=";",index_col=0,header=None,names=['bug_type'])
print(data.shape)
data.head(10)

(97, 1)


Unnamed: 0,bug_type
251,Bee
252,Bee
253,Bee
254,Bee
255,Bee
256,Bee
257,Bee
258,Bee
259,Bee
260,Bee


In [14]:
import os
from PIL import Image

# Load the input image
#os.path.join(os.getcwd() + path) 
image_files = [file for file in os.listdir(path) if file.endswith(('.JPG'))]
print('Found {0} images'.format(len(image_files)))



im_dict = {}

for file in image_files:
    image_path = os.path.join(path, file)
    image = Image.open(image_path)
    im_arr = np.array(image)
    sx, sy, nb_channels = im_arr.shape
    print('Loaded image {0}'.format(file))
    print('Input image of dimension {0} x {1} x {2} pixels'.format(sx, sy,
                                                                nb_channels))
    print('Input image of type {0}'.format(type(im_arr[0, 0, 0])))
    print('Minimal intensity value: {0}'.format(np.min(im_arr)))
    print('Maximal intensity value: {0}'.format(np.max(im_arr)))
    file_name = int(file[:-4])
    im_dict[file_name] = im_arr


Found 97 images
Loaded image 251.JPG
Input image of dimension 4000 x 6000 x 3 pixels
Input image of type <class 'numpy.uint8'>
Minimal intensity value: 0
Maximal intensity value: 255
Loaded image 252.JPG
Input image of dimension 4000 x 6000 x 3 pixels
Input image of type <class 'numpy.uint8'>
Minimal intensity value: 8
Maximal intensity value: 255
Loaded image 253.JPG
Input image of dimension 4000 x 6000 x 3 pixels
Input image of type <class 'numpy.uint8'>
Minimal intensity value: 0
Maximal intensity value: 255
Loaded image 254.JPG
Input image of dimension 4000 x 6000 x 3 pixels
Input image of type <class 'numpy.uint8'>
Minimal intensity value: 0
Maximal intensity value: 255
Loaded image 255.JPG
Input image of dimension 4000 x 6000 x 3 pixels
Input image of type <class 'numpy.uint8'>
Minimal intensity value: 0
Maximal intensity value: 255
Loaded image 256.JPG
Input image of dimension 4000 x 6000 x 3 pixels
Input image of type <class 'numpy.uint8'>
Minimal intensity value: 0
Maximal int

In [15]:
mask_path = path + 'masks/'
mask_dict = {}
mask_files = [file for file in os.listdir(mask_path) if file.endswith(('.tif'))]

for file in mask_files:
    mask_path2 = os.path.join(mask_path, file)
    mask = Image.open(mask_path2).convert('L')
    mask_arr = np.array(mask)
    #mask.show()
    #print(np.unique(mask_arr.flatten()))

    #Some values being a little over 0 or a little below 255, we need to threshold the mask
    mask_arr[mask_arr < 122] = 0
    mask_arr[mask_arr >= 122] = 1
    
    sx, sy = mask_arr.shape
    print('Loaded mask {0}'.format(file))
    print('Mask of dimension {0} x {1} pixels'.format(sx, sy))

    file_name = file[7:-4]
    print("File name : ", file_name)
    num = int(file_name)
    mask_dict[num] = mask_arr

Loaded mask binary_251.tif
Mask of dimension 4000 x 6000 pixels
File name :  251
Loaded mask binary_252.tif
Mask of dimension 4000 x 6000 pixels
File name :  252
Loaded mask binary_253.tif
Mask of dimension 4000 x 6000 pixels
File name :  253
Loaded mask binary_254.tif
Mask of dimension 4000 x 6000 pixels
File name :  254


Loaded mask binary_255.tif
Mask of dimension 4000 x 6000 pixels
File name :  255
Loaded mask binary_256.tif
Mask of dimension 4000 x 6000 pixels
File name :  256
Loaded mask binary_257.tif
Mask of dimension 4000 x 6000 pixels
File name :  257
Loaded mask binary_258.tif
Mask of dimension 4000 x 6000 pixels
File name :  258
Loaded mask binary_259.tif
Mask of dimension 4000 x 6000 pixels
File name :  259
Loaded mask binary_260.tif
Mask of dimension 4000 x 6000 pixels
File name :  260
Loaded mask binary_261.tif
Mask of dimension 4000 x 6000 pixels
File name :  261
Loaded mask binary_262.tif
Mask of dimension 4000 x 6000 pixels
File name :  262
Loaded mask binary_263.tif
Mask of dimension 4000 x 6000 pixels
File name :  263
Loaded mask binary_264.tif
Mask of dimension 4000 x 6000 pixels
File name :  264
Loaded mask binary_265.tif
Mask of dimension 4000 x 6000 pixels
File name :  265
Loaded mask binary_266.tif
Mask of dimension 4000 x 6000 pixels
File name :  266
Loaded mask binary_267.tif
M

### Functions to calculate features / important elements

In [29]:
def convert_to_grayscale(im_arr):
    # Calculate the grayscale values using the standard luminosity method
    gray_im_arr = 0.2989 * im_arr[:, :, 0] + 0.5870 * im_arr[:, :, 1] + 0.1140 * im_arr[:, :, 2]
    return gray_im_arr

In [30]:
def find_edges(mask):
    ### THIS IS THE FUNCTION USED, THE OTHER find_edges FUNCTIONS ARE NOT USED
    ### IDENTIFY THE BOUNDARY PIXELS (EDGES) OF THE INSECT
    start = process_time()
    # Shift the mask in all directions and compare with the original to detect boundaries
    shifted_up = np.roll(mask, -1, axis=0)
    shifted_down = np.roll(mask, 1, axis=0)
    shifted_left = np.roll(mask, -1, axis=1)
    shifted_right = np.roll(mask, 1, axis=1)
    
    # Handle boundary conditions by assuming outside of the mask is all zeros
    shifted_up[-1, :] = 0
    shifted_down[0, :] = 0
    shifted_left[:, -1] = 0
    shifted_right[:, 0] = 0
    
    # Identify edges by looking for discrepancies between the mask and its shifted versions
    edges = (mask != shifted_up) | (mask != shifted_down) | \
            (mask != shifted_left) | (mask != shifted_right)

    # Optionally, convert boolean mask to integer if needed
    edges = edges.astype(int)
    end = process_time()
    print("Time taken for finding edges:", end - start)
    return edges

def find_edges2(mask):
    ### DOESN'T WORK (NOT USED)
    edges = np.zeros_like(mask)
    # Check for transitions from 1 to 0 (insect to background)
    for i in range(mask.shape[0] - 1):
        for j in range(mask.shape[1] - 1):
            if mask[i, j] == 1 and (mask[i+1, j] == 0 or mask[i, j+1] == 0):
                edges[i, j] = 1
    return edges

def find_edges3(mask_arr):
    ### DOESN'T WORK (NOT USED)
    # This function will extract the boundary of the insect based on the mask
    edges = np.zeros_like(mask_arr)
    # Loop over each pixel and check for edges
    for i in range(1, mask_arr.shape[0]-1):
        for j in range(1, mask_arr.shape[1]-1):
            if mask_arr[i, j] == 1 and np.any(mask_arr[i-1:i+2, j-1:j+2] == 0):
                edges[i, j] = 1
    return edges

from skimage.segmentation import find_boundaries
def find_edges_skimage(mask):
    ### FOR TESTING PURPOSES (NOT USED)
    edges = find_boundaries(mask, mode='outer').astype(np.uint8)
    return edges 

def find_edges_opencv(mask):
    ### FOR TESTING PURPOSES (NOT USED)
    # Ensure the mask is of the correct type (uint8), and binarize just in case
    mask = np.uint8(mask * 255)
    # Find contours using OpenCV
    contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    # Create an empty image to draw contours
    #edges = np.zeros_like(mask)
    #cv2.drawContours(edges, contours, -1, (255), 1)  # Draw contours with white color
    return edges


In [31]:
def plot_edges(im_arr,edge_points,title):
    fig, ax = plt.subplots()
    ax.imshow(im_arr, cmap='gray')
    edges_y, edges_x = zip(*edge_points)
    ax.scatter(edges_x, edges_y, color='red', s=0.2)
    ax.set_title(title)
    plt.show()

## 1 - Required features given by instructions

In [32]:
def get_symmetry_index(gray_im_arr):
    ### COMPUTE THE SYMMETRY INDEX OF THE IMAGE
    ### SYMMETRY INDEX : THE MEAN ABSOLUTE DIFFERENCE BETWEEN THE LEFT AND RIGHT HALVES : POSSIBLE RANGE : 0 TO 255
    ### (COMPUTED FOR THE BOUNDING BOX OF THE BUG AND FOR THE WHOLE IMAGE SEPARATELY)
    ### RETURNS THE SYMMETRY INDEX
    
    start = process_time()
    print("Starting computation of symmetry index")
    # Split the grayscale image into left and right halves
    height, width = gray_im_arr.shape
    center = width // 2
    left_half = gray_im_arr[:, :center]
    right_half = gray_im_arr[:, center:]
    
    # Make sure both halves have the same width
    if right_half.shape[1] != left_half.shape[1]:
        # If the image width is odd, the right half will have one more column
        right_half = right_half[:, :-1]

    # Mirror the right half horizontally to compare with the left
    mirrored_right = np.fliplr(right_half)

    # Compute the difference between the mirrored right half and the left half
    differences = np.abs(mirrored_right - left_half)

    # Calculate the symmetry index as the mean of the differences
    symmetry_index = np.mean(differences)
    end = process_time()
    print("Computation took ", end - start)
    return symmetry_index

In [33]:
def calculate_color_aggregates(row,channel,channel_arr):
    ### COMPUTE THE MEAN, MEDIAN, STANDARD DEVIATION, MINIMUM, MAXIMUM, 1ST QUARTILE AND 3RD QUARTILE OF THE COLOR CHANNEL
    ### ORIGINALLY : ONLY THE MEAN, MIN AND MAX WERE REQUIRED
    ### WE DECIDED TO ADD THE REST FOR MORE INFORMATION 
    ### (COMPUTED FOR BUG AND NON BUG POINTS SEPARATELY)
    row[channel + "_mean"] = np.mean(channel_arr)
    row[channel + "_median"] = np.median(channel_arr)
    row[channel + "_std"] = np.std(channel_arr)
    row[channel + "_min"] = np.min(channel_arr)
    row[channel + "_max"] = np.max(channel_arr)
    row[channel + "_q1"] = np.quantile(channel_arr, .25)
    row[channel + "_q3"] = np.quantile(channel_arr, .75)

In [34]:
from sklearn.decomposition import PCA

def find_and_plot_orthogonal_lines(mask, plot=False):
    ### APPROXIMATION OF THE RATIO OF THE LONGEST ORTHOGONAL LINES THAT CAN BE FITTED INSIDE THE MASK USING PCA
    ### COULD NOT FIND A BETTER WAY TO DO THIS

    points = np.column_stack(np.where(mask > 0))  # Get indices of non-zero elements, which are points of the insect

    if len(points) < 5:
        print("Not enough points to fit an ellipse")
        return 0

    # Fit PCA on the provided points
    pca = PCA(n_components=2)
    pca.fit(points)

    # Calculate the center and directions of the principal components
    center = pca.mean_
    direction1 = pca.components_[0]
    direction2 = pca.components_[1]

    # Extend lines to the edge of the mask by finding intersections
    def find_edge(center, direction, mask):
        t_max = 1000  # Maximum line length to check
        for t in np.linspace(0, t_max, num=int(t_max)):
            point = center + direction * t
            if not (0 <= int(point[0]) < mask.shape[0] and 0 <= int(point[1]) < mask.shape[1]) or mask[int(point[0]), int(point[1])] == 0:
                return center + direction * (t - 1)  # Step back one unit
        return center + direction * t_max

    line1_point1 = find_edge(center, -direction1, mask)
    line1_point2 = find_edge(center, direction1, mask)
    line2_point1 = find_edge(center, -direction2, mask)
    line2_point2 = find_edge(center, direction2, mask)

    # Calculate the ratio of the smaller dimension to the larger dimension
    length1 = np.linalg.norm(line1_point2 - line1_point1)
    length2 = np.linalg.norm(line2_point2 - line2_point1)
    ratio = min(length1, length2) / max(length1, length2) if max(length1, length2) != 0 else 0

    if plot:
        # Plotting
        fig, ax = plt.subplots()
        ax.imshow(mask, cmap='gray', interpolation='nearest')
        ax.plot([line1_point1[1], line1_point2[1]], [line1_point1[0], line1_point2[0]], 'r-')  # Red line
        ax.plot([line2_point1[1], line2_point2[1]], [line2_point1[0], line2_point2[0]], 'b-')  # Blue line
        plt.axis('off')
        plt.show()

    return ratio


### Functions for additional features (our own)

In [35]:
def get_ratio(mask_area,im_1d_arr):
    ### COMPUTE THE RATIO BETWEEN THE AREA OF THE INSECT AND THE TOTAL AREA OF THE IMAGE (nb_pixels_ratio)
    ratio = mask_area / np.size(im_1d_arr)
    #print('The bug fills {0}% of the image'.format(ratio * 100))
    return ratio

In [36]:
def calculate_bounding_rectangle_features(row, mask_arr, mask_area):
    ### COMPUTE THE ASPECT RATIO AND RECTANGULARITY OF THE INSECT BY FINDING THE BOUNDING BOX
    ### ASPECT RATIO = MIN(HEIGHT, WIDTH) / MAX(HEIGHT, WIDTH)
    ### RECTANGULARITY = AREA OF THE INSECT / (HEIGHT * WIDTH)
    ### RETURN THE TOP, BOTTOM, LEFT AND RIGHT BOUNDARIES OF THE BOUNDING BOX

    start = process_time()
    print("Starting computation of bounding rectangle features")
    # Ensure that mask_arr is a binary mask (0s and 1s)
    if mask_arr.dtype != bool:
        mask_arr = mask_arr > 0

    # Find rows and columns where the mask has at least one pixel
    rows_with_insect = np.any(mask_arr, axis=1)
    cols_with_insect = np.any(mask_arr, axis=0)

    # Find the indices of the rows and columns that contain the insect
    rows_indices = np.where(rows_with_insect)[0]
    cols_indices = np.where(cols_with_insect)[0]

    if len(rows_indices) == 0 or len(cols_indices) == 0:
        # No insect found in the mask, return an aspect ratio of 0
        return 0

    # Calculate the max extents in both directions
    vertical_extent = rows_indices[-1] - rows_indices[0] + 1
    horizontal_extent = cols_indices[-1] - cols_indices[0] + 1

    # Calculate the aspect ratio (smaller dimension divided by the larger dimension)
    aspect_ratio = min(vertical_extent, horizontal_extent) / max(vertical_extent, horizontal_extent)

    end = process_time()
    print("Computation took ", end - start)
    row['aspect_ratio'] = aspect_ratio
    row['rectangularity'] = mask_area / (vertical_extent * horizontal_extent)

    # Extract the mask values within the bounding box dimensions
    return rows_indices[0],rows_indices[-1], cols_indices[0],cols_indices[-1]

In [37]:
def get_roundness(area, edges):
    ### COMPUTE THE ROUNDNESS USING AREA AND PERIMETER : INDICATOR OF HOW ROUND THE MASK OF THE BUG IS
    ### ROUNDNESS = 4 * PI * AREA / PERIMETER^2 (PERIMETER : SUM OF EDGE PIXELS ; AREA : SUM OF MASK PIXELS)
    ### RETURN THE ROUNDNESS AND PERIMETER OF THE MASK
    
    perimeter = np.sum(edges)

    if perimeter == 0:  # Avoid division by zero
        return 0
    roundness = 4 * np.pi * area / (perimeter ** 2)
    return roundness, perimeter

def get_roundness2(mask,area):
    ### OLD FUNCTION (NOT USED)
    
    # Calculate the perimeter by detecting edges within the mask
    # Shift the mask in all directions and compare with the original to detect boundaries
    shifted_up = np.roll(mask, -1, axis=0)
    shifted_down = np.roll(mask, 1, axis=0)
    shifted_left = np.roll(mask, -1, axis=1)
    shifted_right = np.roll(mask, 1, axis=1)
    
    # Handle boundary conditions by assuming outside of the mask is all zeros
    shifted_up[-1, :] = 0
    shifted_down[0, :] = 0
    shifted_left[:, -1] = 0
    shifted_right[:, 0] = 0
    
    # Calculate perimeter by summing up the boundary detections
    perimeter = np.sum(mask != shifted_up) + np.sum(mask != shifted_down) + \
                np.sum(mask != shifted_left) + np.sum(mask != shifted_right)
    print("Calculated perimeter is ", perimeter)
    
    # Calculate roundness using the area and perimeter
    if perimeter == 0:  # To avoid division by zero error
        return 0, 0
    roundness = 4 * np.pi * area / (perimeter ** 2)

    return roundness, perimeter

In [38]:
def rgb_to_hsv_opencv(rgb_image):
    ### TOOL TO CONVERT RED GREEN BLUE TO HUE SATURATION VALUE WITH OPENCV
    start = process_time()
    print("Starting conversion from RGB to HSV using OPENCV")
    hsv_image = cv2.cvtColor(rgb_image, cv2.COLOR_RGB2HSV)
    if hsv_image.shape[0] != rgb_image.shape[0]:
        # Transpose only if the height is greater than the width
        hsv_image = np.transpose(hsv_image, (1, 0, 2))
    end = process_time()
    print("The conversion took ", end-start)
    return hsv_image

def rgb_to_hsv(rgb_image):
    ### (UNUSED) TOOL TO CONVERT RED GREEN BLUE TO HUE SATURATION VALUE MANUALLY 
    ### HUE = 60 * (0 + (G - B) / delta) % 360 (delta : c_max - c_min ; c_max : max of R,G,B ; c_min : min of R,G,B)
    ### SATURATION = delta / c_max 
    ### VALUE = c_max
    start = process_time()
    print("Starting conversion from RGB to HSV ")
    rgb_image = rgb_image.astype('float') / 255  # Normalize RGB values to [0, 1]

    # Prepare arrays for the HSV channels
    hsv_image = np.zeros_like(rgb_image)

    r, g, b = rgb_image[:, :, 0], rgb_image[:, :, 1], rgb_image[:, :, 2]
    c_max = np.max(rgb_image, axis=2)
    c_min = np.min(rgb_image, axis=2)
    delta = c_max - c_min

    # Initialize hue
    hue = np.zeros_like(c_max)

    # Calculate the hue component
    mask = delta != 0  # Mask where delta is not zero
    # Red is max
    hue[mask & (c_max == r)] = 60 * (0 + (g - b)[mask & (c_max == r)] / delta[mask & (c_max == r)]) % 360
    # Green is max
    hue[mask & (c_max == g)] = 60 * (2 + (b - r)[mask & (c_max == g)] / delta[mask & (c_max == g)]) % 360
    # Blue is max
    hue[mask & (c_max == b)] = 60 * (4 + (r - g)[mask & (c_max == b)] / delta[mask & (c_max == b)]) % 360

    # Calculate the saturation component
    saturation = np.zeros_like(c_max)
    saturation[c_max != 0] = delta[c_max != 0] / c_max[c_max != 0]

    # Value component is just the max of R, G, B
    value = c_max

    # Combine into a single HSV image
    hsv_image[:, :, 0] = hue
    hsv_image[:, :, 1] = saturation
    hsv_image[:, :, 2] = value

    end = process_time()
    print("The conversion took ", end-start)
    return hsv_image

In [39]:
def calculate_hsv_statistics_for_array(row,channel_name,arr):
    ### COMPUTE THE MEAN, MEDIAN, STANDARD DEVIATION, MINIMUM, MAXIMUM, 1ST QUARTILE AND 3RD QUARTILE OF THE HSV CHANNEL
    ### (COMPUTED FOR BUG AND NON BUG POINTS SEPARATELY)
    row[f'{channel_name }_mean'] = np.mean(arr)
    row[f'{channel_name}_std'] = np.std(arr)
    row[f'{channel_name}_median'] = np.median(arr)
    row[f'{channel_name}_min'] = np.min(arr)
    row[f'{channel_name}_max'] = np.max(arr)
    row[f'{channel_name}_q1'] = np.quantile(arr, .25)
    row[f'{channel_name}_q3'] = np.quantile(arr, .75)

def calculate_hsv_statistics(row, hsv_image, bug_indices, non_bug_indices):
    ### COMPUTE THE HSV STATISTICS 
    ### USED FOR BOTH THE BUG AND NON BUG PIXELS
    start = process_time()
    print("Starting computation of HSV statistics ")

    # Calculate statistics for each channel
    for i, channel_name in enumerate(['hue', 'saturation', 'value']):
        channel_data = hsv_image[:, :, i]
        
        calculate_hsv_statistics_for_array(row,channel_name + '_mask',channel_data[bug_indices])
        calculate_hsv_statistics_for_array(row,channel_name + '_rest',channel_data[non_bug_indices])

    end = process_time()
    print("Computation took ", end - start)

In [40]:
def calculate_entropy(gray_im_arr):
    ### COMPUTE THE ENTROPY OF THE IMAGE TEXTURE : INDICATOR OF RANDOMNESS (COMPLEXITY) IN THE IMAGE TEXTURE 
    ### ENTROPY = -SUM(p * log(p)) WHERE p IS THE PROBABILITY OF EACH INTENSITY LEVEL
    ### (COMPUTED FOR BOTH BUG AND NON BUG PIXELS : MASK AND REST)
    start = process_time()
    print("Starting computation of entropy")

    # Calculate histogram of pixel intensities in the masked region
    histogram, _ = np.histogram(gray_im_arr, bins=256, range=(0, 256))
    histogram = histogram.astype(float)

    # Normalize the histogram to get probability distribution
    histogram /= histogram.sum()

    # Calculate entropy using the entropy formula: -sum(p * log(p)) where p is the probability of each intensity level
    with np.errstate(divide='ignore', invalid='ignore'):
        # Ignore log(0) which is undefined; np.where takes care of zero probabilities
        entropy = -np.sum(np.where(histogram > 0, histogram * np.log2(histogram), 0))
    
    end = process_time()
    print("Computation took ", end - start)
    return entropy

In [41]:
def calculate_geometric_centroid_distance_aggregates(row,mask_arr, bug_argwhere_indices, centroid, boundary_pixels=None, plot=None):
    ### COMPUTE THE DISTANCES OF EACH BOUNDARY PIXEL FROM THE GEOMETRIC CENTROID OF THE INSECT (mean...)
    ### CENTROID DISTANCE = SQRT((Y - Yc)^2 + (X - Xc)^2) WHERE (Yc, Xc) IS THE CENTROID
    ### THEN COMPUTE STATISTICS ON THESE DISTANCES (mean, std, min, max, q1, q3, skewness, kurtosis)  
    
    start = process_time()
    print("Starting computation of geometric centroid distance aggregates")
    
    # Identify boundary pixels using a simple edge detection logic
    if boundary_pixels is None:
        boundary_pixels = []
        for y, x in bug_argwhere_indices:
            # Check if any of the eight surrounding pixels are zero (part of the background)
            if (mask_arr[y-1:y+2, x-1:x+2] == 0).any():
                boundary_pixels.append((y, x))
    if plot is not None:
        print(len(boundary_pixels))
        plot_edges(mask_arr,boundary_pixels,plot)

    # Calculate distances from centroid to each boundary pixel
    boundary_pixels = np.array(boundary_pixels)
    distances = np.linalg.norm(boundary_pixels - centroid, axis=1)
    
    # Compute statistics
    mean_distance = np.mean(distances)
    std_distance = np.std(distances)
    max_distance = np.max(distances)
    q1_distance = np.quantile(distances,0.25)
    q3_distance = np.quantile(distances,0.75)
    min_distance = np.min(distances)
    skewness = skew(distances)
    kurt = kurtosis(distances)
    row['mean_centroid_distance'], row['std_centroid_distance'], row['min_centroid_distance'], row['max_centroid_distance'] = mean_distance, std_distance, min_distance, max_distance
    row['skewness_centroid_distance'], row['kurtosis_centroid_distance'] = skewness, kurt
    row['q1_centroid_distance'], row['q3_centroid_distance'] = q1_distance, q3_distance

    end = process_time()
    print("Computation took ", end - start)

In [42]:
def calculate_fractal_dimension(mask_arr, edges):
    ### UNUSED (DOESNT WORK)
    start = process_time()
    print("Starting computation of fractal dimension")

    # Calculate the fractal dimension using the box-counting method
    def box_counting(edges):
        sizes = np.arange(1, min(edges.shape)//2, 2)  # range of box sizes
        counts = []
        for size in sizes:
            count = 0
            for i in range(0, edges.shape[0], size):
                for j in range(0, edges.shape[1], size):
                    block = edges[i:i+size, j:j+size]
                    if np.any(block == 1):
                        count += 1
            counts.append(count)
        
        counts = np.array(counts)
        sizes = np.array(sizes)
        # Use only sizes and counts that are non-zero to avoid log(0)
        valid = counts > 0
        coeffs = np.polyfit(np.log(sizes[valid]), np.log(counts[valid]), 1)
        return -coeffs[0]  # Fractal dimension is the negative slope of log-log plot

    fractal_dimension = box_counting(edges)
    end = process_time()
    print("Computation took ", end - start)
    return fractal_dimension

from skimage.measure import label, regionprops
from skimage.transform import resize
def calculate_fractal_dimension_fast(thresholded):
    ### UNUSED (DOESNT WORK)
    start = process_time()
    print("Starting computation of fractal dimension")

    # List to hold the number of boxes and scales
    scales = np.linspace(0.1, 1.0, num=10)
    counts = []

    # Analyze fractal dimension using different scales
    for scale in scales:
        scaled = resize(thresholded, (int(scale * image.shape[0]), int(scale * image.shape[1])), order=0, preserve_range=True, anti_aliasing=False)
        labels = label(scaled)
        properties = regionprops(labels)
        # Count regions with area larger than 0
        count = sum([prop.area for prop in properties if prop.area > 0])
        counts.append(count)

    # Calculate the fractal dimension
    # Assuming counts vary with scale as a power law (which is typical for fractal objects)
    coeffs = np.polyfit(np.log(scales), np.log(counts), 1)
    fractal_dim = -coeffs[0]
    end = process_time()
    print("Computation took ", end - start)


In [43]:
def calculate_fourier_descriptors(row,mask_arr, num_descriptors, edges_x, edges_y):
    ### COMPUTE THE FOURIER DESCRIPTORS OF THE INSECT BOUNDARY (EDGES) TO DESCRIBE THE SHAPE OF THE INSECT
    ### FOURIER DESCRIPTOR : A COMPLEX NUMBER REPRESENTING THE AMPLITUDE AND PHASE OF A WAVE
    
    start = process_time()
    print("Starting computation of Fourier descriptors")
    # Combine x and y to form a complex number, which represents the boundary
    boundary_complex = edges_x + 1j * edges_y
    # Compute the Fourier Transform
    fourier_result = np.fft.fft(boundary_complex)
    # Retain only the first 'num_descriptors' Fourier coefficients
    descriptors = fourier_result[:num_descriptors]
    # Normalize the descriptors by the first descriptor to make the description scale-invariant
    descriptors /= np.abs(descriptors[0])
    end = process_time()
    print("Computation took ", end - start)
    for i, descriptor in enumerate(descriptors):
        row[f'fourier_descriptor_real_{i}'] = descriptor.real
        row[f'fourier_descriptor_imag_{i}'] = descriptor.imag
        row
    return descriptors

In [44]:
def calculate_convex_hull_properties(row, mask_arr, hull, insect_area, insect_perimeter):
    ### COMPUTE THE PROPERTIES OF THE CONVEX HULL (SMALLEST CONVEX SHAPE THAT ENCLOSES THE INSECT) BY COMPARING IT TO THE INSECT
    ### HULL TO INSECT AREA RATIO : RATIO OF THE AREA OF THE CONVEX HULL TO THE AREA OF THE INSECT
    ### HULL CONVEXITY : RATIO OF THE PERIMETER OF THE CONVEX HULL TO THE PERIMETER OF THE INSECT
    hull_area = hull.volume  # In 2D, volume is the area
    hull_perimeter = hull.area  # In 2D, area is the perimeter

    # Area ratio (insect area to convex hull area)
    area_ratio = insect_area / hull_area if hull_area > 0 else 0

    # Convexity
    convexity = hull_perimeter / insect_perimeter if hull_perimeter > 0 else 0
    
    row['hull_area'] = hull_area
    row['hull_to_insect_area_ratio'] = area_ratio
    row['hull_convexity'] = convexity


In [45]:
def fit_triangle(points, hull):
    ### (USED BY BELOW FUNCTION get_triangle_similarity FOR TRIANGLE SIMILARITY SCORE) 
    ### COMPUTE THE TRIANGLE THAT FITS THE SHAPE OF THE INSECT 
    ### RETURN THE TRIANGLE AND THE AREA OF THE TRIANGLE

    start = process_time()
    print("Starting computation of triangle fit")

    hull_points = points[hull.vertices]

    # Simplified assumption: Pick three points that are furthest apart in the hull to form a triangle
    max_dist = 0
    triangle = None
    n = len(hull_points)
    for i in range(n):
        for j in range(i + 1, n):
            for k in range(j + 1, n):
                tri_points = np.array([hull_points[i], hull_points[j], hull_points[k]])
                # Calculate perimeter of the triangle
                perimeter = np.sum(np.sqrt(np.sum(np.diff(np.vstack([tri_points, tri_points[0]]), axis=0)**2, axis=1)))
                if perimeter > max_dist:
                    max_dist = perimeter
                    triangle = tri_points

    # Calculate area of the triangle using Heron's formula
    if triangle is not None:
        a = np.linalg.norm(triangle[0] - triangle[1])
        b = np.linalg.norm(triangle[1] - triangle[2])
        c = np.linalg.norm(triangle[2] - triangle[0])
        s = 0.5 * (a + b + c)
        area = math.sqrt(s * (s - a) * (s - b) * (s - c))
        end = process_time()
        print("Computation took ", end - start)
        return triangle, area
    end = process_time()
    print("Computation took ", end - start)
    return None, 0


In [46]:
def get_triangle_similarity(hull, points, insect_area):    
    ### COMPUTE THE SIMILARITY SCORE BETWEEN THE INSECT AND THE TRIANGLE THAT FITS THE SHAPE OF THE INSECT
    ### SIMILARITY = MIN(AREA OF INSECT, AREA OF TRIANGLE) / MAX(AREA OF INSECT, AREA OF TRIANGLE)
    
    # Fit triangle to points
    triangle, triangle_area = fit_triangle(points, hull)
    
    # Similarity score based on area comparison
    if triangle_area == 0:  # Avoid division by zero
        return 0
    similarity = min(insect_area, triangle_area) / max(insect_area, triangle_area)
    return similarity

In [47]:
def calculate_ellipse_features(row, mask_arr, points):
    ### COMPUTE THE FEATURES OF THE ELLIPSE FIT TO THE INSECT USING THE POINTS OF THE INSECT
    ### ELLIPSE ANGLE = ANGLE OF THE ELLIPSE
    ### ELLIPSE AXIS RATIO = MINOR AXIS LENGTH / MAJOR AXIS LENGTH
    ### ELLIPSE ECCENTRICITY FORMULA = SQRT(1 - (b^2 / a^2)) WHERE a = MAJOR AXIS LENGTH AND b = MINOR AXIS LENGTH

    start = process_time()
    print("Starting computation of ellipse features")
    
    # Check if there are enough points to fit an ellipse
    if len(points) < 5:
        print("Not enough points to fit an ellipse")
        return
    
    # Fit an ellipse to the points
    ellipse = cv2.fitEllipse(points)
    
    # Extract features from the ellipse
    (center, axes, angle) = ellipse
    major_axis_length = max(axes)
    minor_axis_length = min(axes)
    axis_ratio = minor_axis_length / major_axis_length if major_axis_length != 0 else 0
    
    # Calculate eccentricity
    a = major_axis_length / 2  # semi-major axis
    b = minor_axis_length / 2  # semi-minor axis
    eccentricity = np.sqrt(1 - (b**2 / a**2)) if a != 0 else 0

    # Calculate ellipse variance
    ellipse_variance = calculate_ellipse_variance(points, ellipse)

    row['ellipse_angle'] = angle
    row['ellipse_axis_ratio'] = axis_ratio
    row['ellipse_eccentricity'] = eccentricity
    row['ellipse_variance'] = ellipse_variance

    end = process_time()
    print("Computation took ", end - start)

def calculate_ellipse_variance(points, ellipse):
    (center, axes, angle) = ellipse
    cx, cy = center
    rx, ry = axes[0] / 2, axes[1] / 2
    angle_rad = np.deg2rad(angle)
    cos_angle = np.cos(angle_rad)
    sin_angle = np.sin(angle_rad)

    def point_to_ellipse_distance(x, y):
        # Translate point to origin based on ellipse center
        xt = x - cx
        yt = y - cy
        
        # Rotate point coordinates
        xr = cos_angle * xt + sin_angle * yt
        yr = -sin_angle * xt + cos_angle * yt
        
        # Calculate the distance from the point to the ellipse
        part1 = (xr / rx) ** 2
        part2 = (yr / ry) ** 2
        return (part1 + part2) ** 0.5

    # Calculate distances from all points to the ellipse and compute their variance
    distances = [point_to_ellipse_distance(x, y) for (x, y) in points]
    variance = np.var(distances)
    return variance


In [48]:
from skimage.morphology import skeletonize
import itertools
from scipy.spatial.distance import pdist

def is_likely_antenna(contour, length, i, elongation_threshold=6, length_threshold=50):
    ### UNUSED (DOESNT WORK PROPERLY TO DETECT ANTENNA)
    ### INITIAL GOAL : DETERMINE IF A CONTOUR IS LIKELY AN ANTENNA BASED ON ITS ELONGATION AND LENGTH
    x, y, w, h = cv2.boundingRect(contour)
    elongation = max(w, h) / (min(w, h) if min(w, h) > 0 else 1)
    print('Dimension of bounding rect for contour {0}: [{1};{2}]'.format(i,w,h))
    print('Elongation : {0} | Length : {1}'.format(elongation,length))
    return (length > length_threshold and elongation > elongation_threshold), length

In [49]:
def caculate_body_parts_features(row,mask, blockSize, ksize, k, threshold_coef, plot=False, img=None, bug_type=None):
    ### COMPUTE THE FEATURES OF THE BODY PARTS OF THE INSECT USING THE MASK OF THE INSECT
    ### 1 : FINDING THE BODY PARTS : SKELETONIZE THE MASK, APPLY HARRIS CORNER DETECTION, FIND CONTOURS IN THE SKELETON
    ### 2 : EXTRACTING STATISTICS : MEAN LENGTH, MAX LENGTH, MIN LENGTH, STD LENGTH
    ### 3 : EXTRACTING THE SPREAD OF BODY PARTS : MEAN DISTANCE BETWEEN CENTROIDS OF BODY PARTS

    start = process_time()
    print("Starting computation of body parts features")

    # Skeletonize the mask
    skeleton = skeletonize(mask.astype(bool)).astype(np.uint8) * 255

    # Apply Harris corner detection
    harris_corners = cv2.cornerHarris(skeleton, blockSize, ksize, k)
    corners_threshold = harris_corners > threshold_coef * harris_corners.max()
    skeleton[corners_threshold] = 0  # Optional: remove corners for clearer paths

    # Find contours in the skeleton
    contours, _ = cv2.findContours(skeleton, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

    # Analyze each contour to determine if it's an antenna
    body_part_lengths = []
    #antennas = []
    centroids = []
    for i, contour in enumerate(contours):
        length = cv2.arcLength(contour, True)
        #if is_likely_antenna(contour, length, i): antennas.append(contour)
        body_part_lengths.append(length)
        cx = int(np.mean(contour[:, :, 0]))
        cy = int(np.mean(contour[:, :, 1]))
        centroids.append([cx, cy])

    # Calculate statistics
    n = len(contours)
    if n == 0:
        print("Warning : could not detect any body part") 
        return
    else:
        row['body_parts_mean_length'] = np.mean(body_part_lengths)
        row['body_parts_max_length'] = np.max(body_part_lengths)
        row['body_parts_min_length']  = np.min(body_part_lengths)
        row['body_parts_std_length'] = np.std(body_part_lengths)

    # Calculate mean distance between centroids
    if len(centroids) > 1:
        centroid_array = np.array(centroids)
        distances = pdist(centroid_array, 'euclidean')
        mean_distance = np.mean(distances)
        row['body_parts_spread'] = mean_distance
    else:
        print("Warning : Only one body part detected")

    # Visualization and Results
    if plot:
        #num_antennas = len(antennas)
        #print("Number of antennas: {0}".format(num_body_parts))
        plt.figure(figsize=(10, 5))
        plt.subplot(121)
        extra = "" if bug_type is None else " (" + bug_type + ")"
        if img is None:
            plt.imshow(mask, cmap='gray')
            plt.title('Original Picture' + extra)
        else:
            plt.imshow(img)
            plt.title('Original Mask' + extra)
        plt.axis('off')
        
        plt.subplot(122)
        plt.imshow(skeleton, cmap='gray')
        y, x = np.where(corners_threshold) 
        plt.scatter(x, y, color='white', s=0.5, marker='x', label="Corner", alpha=0.5)  # mark corners
        #for i, antenna in enumerate(antennas):
        #    label="Antenna" if i == 0 else None
        #    plt.plot(antenna[:, 0, 0], antenna[:, 0, 1], 'r', linewidth=8, label=label, alpha=0.3)  # plot antennas
        for i, contour in enumerate(contours):
            plt.plot(contour[:, 0, 0], contour[:, 0, 1], linewidth=2, label="Contour " + str(i))

        plt.legend()
        plt.title('Skeleton with Antennas Identified (n_centroids={0})'.format(len(centroids)))
        plt.axis('off')

        plt.tight_layout()
        plt.show()
    end = process_time()
    print("Computation took ", end-start)


In [50]:
def calculate_axis_least_inertia(row,bug_argwhere_indices, centroid):
    ### COMPUTE THE AXIS OF LEAST INERTIA OF THE INSECT (INDICATOR OF THE ORIENTATION OF THE SHAPE)
    ### AXIS OF LEAST INERTIA : THE LINE FOR WHICH THE INTEGRAL OF THE SQUARE OF THE DISTANCES TO POINTS ON THE SHAPE BOUNDARY IS A MINIMUM
    ### X AXIS : THE EIGENVECTOR ASSOCIATED WITH THE SMALLEST EIGENVALUE OF THE COVARIANCE MATRIX OF THE SHIFTED POINTS
    ### Y AXIS : THE EIGENVECTOR ASSOCIATED WITH THE LARGEST EIGENVALUE OF THE COVARIANCE MATRIX OF THE SHIFTED POINTS

    start = process_time()
    print("Starting computation of axis of least inertia")
    # Shift points so centroid is at the origin
    shifted_points = bug_argwhere_indices - centroid
    
    # Compute covariance matrix of the shifted points
    covariance_matrix = np.cov(shifted_points, rowvar=False)
    
    # Perform eigenvalue decomposition
    eigenvalues, eigenvectors = np.linalg.eigh(covariance_matrix)
    
    # Find the eigenvector associated with the smallest eigenvalue
    min_eigenvalue_index = np.argmin(eigenvalues)
    axis_least_inertia = eigenvectors[:, min_eigenvalue_index]
    row['axis_least_inertia_x'] = axis_least_inertia[0]
    row['axis_least_inertia_y'] = axis_least_inertia[1]
    end = process_time()
    print("Computation took ", end - start)


In [51]:
def get_mask_compactness(mask_area, mask_perimeter):
    ### COMPUTE THE COMPACTNESS OF THE INSECT MASK (INDICATOR OF HOW CLOSE THE SHAPE IS TO A CIRCLE)
    ### SHOULD BE EQUAL TO THE ROUNDNESS FEATURE - WE'LL CHECK LATER
    ### COMPACTNESS = 4 * PI * AREA / PERIMETER^2
    ### RETURNS THE COMPACTNESS
    
    if mask_perimeter == 0:  # Avoid division by zero
        return 0
    compactness = 4 * np.pi * mask_area / (mask_perimeter ** 2)
    return compactness

#### INITIALIZING THE COLUMNS FOR OUR FEATURES

In [52]:
print(len(im_dict), data.shape)
list(im_dict.keys()) == data.index.values.tolist()

97 (97, 137)


True

In [53]:
data = data.assign(nb_pixels_ratio=0, image_symmetry_index=0,mask_bb_symmetry_index=0, orthogonal_lines_ratio=0, roundness=0, 
                    mean_centroid_distance=0, std_centroid_distance=0, max_centroid_distance=0, min_centroid_distance=0,
                    skewness_centroid_distance=0, kurtosis_centroid_distance=0, aspect_ratio=0, mask_area=0, mask_perimeter=0, 
                    mask_compactness=0, hull_area=0, hull_to_insect_area_ratio=0, hull_convexity=0, 
                    hull_triangle_similarity=0, ellipse_angle=0, ellipse_axis_ratio=0, ellipse_eccentricity=0, ellipse_variance=0, 
                    axis_least_inertia_x=0, axis_least_inertia_y=0,rectangularity=0, body_parts_mean_length=0, body_parts_max_length=0,
                    body_parts_std_length=0, body_parts_spread=0, rest_entropy=0, mask_entropy=0)
colors = ['red', 'green', 'blue']
hsv = ['hue', 'saturation', 'value']
aggregates = ["min","max","mean","median","std","q1","q3"]
for color in colors + hsv:
    for aggregate in aggregates:
        data[color + "_mask_" + aggregate] = 0
        data[color + "_rest_" + aggregate] = 0

num_descriptors = 10
for i in range(num_descriptors):
    data[f'fourier_descriptor_real_{i}'] = 0
    data[f'fourier_descriptor_imag_{i}'] = 0
data = data.copy()

#uncomment if you want to load data with the features already computed and compute a new feature
data = pd.read_csv("data/processed_data.csv",header=0,index_col='ID')
for col in data.columns:
    if data[col].sum() == 0:
        print(col + ' is full of 0s')

### MAIN FUNCTION TO COMPUTE ALL FEATURES FROM THE IMAGES AND MASKS IN TRAIN FOLDER

In [59]:
from scipy.spatial import ConvexHull


def compute_features(debug=True, plot_skeleton=False):
    #centroid_cols = [x for x in data.columns if "centroid" in x]

    #for index, row in data.iloc[:2].iterrows():
    for index, row in data.iterrows():
        print("Processing image ", index)
        num = index
        im_arr = im_dict[num]
        mask_arr = mask_dict[num]

        ## Step 1 : Preparation for feature computations (variable we will reuse)
        gray_im_arr = convert_to_grayscale(im_arr)
        mask_area = np.sum(mask_arr)
        is_mask = mask_arr == 1
        bug_indices = np.where(is_mask)
        non_bug_indices = np.where(~is_mask) #~mask_arr.astype(bool)
        bug_argwhere_indices = np.argwhere(is_mask)
        centroid = np.mean(bug_argwhere_indices, axis=0)
        edges = find_edges(mask_arr)
        edges_y, edges_x = np.where(edges == 1)
        
        # Extract points from the mask
        points = np.column_stack(bug_indices)
        
        if debug:
            print("Shape of the image array: ", im_arr.shape)
            print("Shape of the mask array: ", mask_arr.shape)
            print("Shape of bug points array :", bug_indices)
            #edges_opencv = find_edges_opencv(mask_arr)
            #edges_skimage = find_edges_skimage(mask_arr)
            #print("Shape of the opencv edges array: ", edges_opencv.shape)
            #print("Shape of the skimage edges array: ", edges_skimage.shape)
            #print(np.unique(edges_opencv), np.unique(edges_skimage))
            #print("Roundness is ",get_roundness(mask_area, edges))
            #print("Roundness of opencv is ",get_roundness(mask_area, edges_opencv))
            #print("Roundness of skimage is ", get_roundness(mask_area, edges_skimage))

        ## Step 2 : Bounding rectangle and skeleton features
        bb_minx, bb_maxx, bb_miny, bb_maxy = calculate_bounding_rectangle_features(row, mask_arr, mask_area)
        mask_arr_in_bb = mask_arr[bb_minx:bb_maxx+1, bb_miny:bb_maxy+1]
        row['orthogonal_lines_ratio'] = find_and_plot_orthogonal_lines(mask_arr_in_bb)
        if plot_skeleton:
            im_arr_in_bb = im_arr[bb_minx:bb_maxx+1, bb_miny:bb_maxy+1]
            caculate_body_parts_features(row,mask_arr_in_bb,20, 5, 0.2, 0.25, plot=plot_skeleton, img=im_arr_in_bb, bug_type=row.bug_type)
            continue
        else:
            caculate_body_parts_features(row,mask_arr_in_bb,20, 5, 0.2, 0.25)
        
        ## Step 3 : Default features (celles du pdf)
        row['nb_pixels_ratio'] = get_ratio(mask_area,gray_im_arr)
        for channel_id, channel_color in enumerate(colors):
            channel_arr = im_arr[:, :, channel_id]
            #get the color values only within the mask
            calculate_color_aggregates(row,channel_color + '_mask',channel_arr[bug_indices])
            calculate_color_aggregates(row,channel_color + '_rest', channel_arr[non_bug_indices])
        row['image_symmetry_index'] = get_symmetry_index(gray_im_arr)
        row['mask_bb_symmetry_index'] = get_symmetry_index(gray_im_arr[bb_minx:bb_maxx+1, bb_miny:bb_maxy+1])

        ## Step 4 : Additional features
        row['roundness'], mask_perimeter = get_roundness(mask_area, edges)
        #hsv_image = rgb_to_hsv(im_arr)
        #calculate_hsv_statistics(row, hsv_image, bug_indices, non_bug_indices)
        hsv_image2 = rgb_to_hsv_opencv(im_arr)
        #print(hsv_image2.shape,im_arr.shape,non_bug_indices[0].shape,non_bug_indices[1].shape)
        calculate_hsv_statistics(row, hsv_image2, bug_indices, non_bug_indices)
        
        row['rest_entropy'] = calculate_entropy(gray_im_arr[non_bug_indices])
        row['mask_entropy'] = calculate_entropy(gray_im_arr[bug_indices])
        row['mask_area'] = mask_area
        row['mask_perimeter'] = mask_perimeter
        row['mask_compactness'] =  get_mask_compactness(mask_area, mask_perimeter)
        edge_points = np.column_stack((edges_y, edges_x))
        calculate_geometric_centroid_distance_aggregates(row,mask_arr, bug_argwhere_indices, centroid, boundary_pixels=edge_points)
        #print(row[centroid_cols])
        #calculate_geometric_centroid_distance_aggregates(row,mask_arr, bug_argwhere_indices, centroid, plot='Manual')
        #print(row[centroid_cols])
        #row['fractal_dimension'] = calculate_fractal_dimension(mask_arr)
        calculate_fourier_descriptors(row,mask_arr, 10, edges_x, edges_y)
        calculate_axis_least_inertia(row,bug_argwhere_indices, centroid)
        calculate_ellipse_features(row, mask_arr, points)

        # Calculate convex hull properties
        hull = ConvexHull(points)
        calculate_convex_hull_properties(row, mask_arr, hull, mask_area, mask_perimeter)
        row['hull_triangle_similarity'] = get_triangle_similarity(hull, points, mask_area)

        ### Step 5 : Save the row
        data.loc[index]= row
    return data.head()
compute_features(plot_skeleton=False)

Processing image  251
Time taken for finding edges: 0.03125
Shape of the image array:  (4000, 6000, 3)
Shape of the mask array:  (4000, 6000)
Shape of bug points array : (array([1257, 1258, 1259, ..., 2266, 2266, 2267], dtype=int64), array([2464, 2464, 2463, ..., 2364, 2365, 2364], dtype=int64))
Starting computation of bounding rectangle features
Computation took  0.0
Starting computation of body parts features
Computation took  0.25
Starting computation of symmetry index
Computation took  0.078125
Starting computation of symmetry index
Computation took  0.0
Starting conversion from RGB to HSV using OPENCV
The conversion took  0.015625
Starting computation of HSV statistics 
Computation took  0.890625
Starting computation of entropy
Computation took  0.3125
Starting computation of entropy
Computation took  0.0
Starting computation of geometric centroid distance aggregates
Computation took  0.0
Starting computation of Fourier descriptors
Computation took  0.015625
Starting computation o

Unnamed: 0,bug_type,nb_pixels_ratio,image_symmetry_index,mask_bb_symmetry_index,orthogonal_lines_ratio,roundness,mean_centroid_distance,std_centroid_distance,max_centroid_distance,min_centroid_distance,skewness_centroid_distance,kurtosis_centroid_distance,aspect_ratio,mask_area,mask_perimeter,mask_compactness,hull_area,hull_to_insect_area_ratio,hull_convexity,hull_triangle_similarity,ellipse_angle,ellipse_axis_ratio,ellipse_eccentricity,ellipse_variance,axis_least_inertia_x,axis_least_inertia_y,rectangularity,body_parts_mean_length,body_parts_max_length,body_parts_std_length,body_parts_spread,rest_entropy,mask_entropy,red_mask_min,red_rest_min,red_mask_max,red_rest_max,red_mask_mean,red_rest_mean,red_mask_median,red_rest_median,red_mask_std,red_rest_std,red_mask_q1,red_rest_q1,red_mask_q3,red_rest_q3,green_mask_min,green_rest_min,green_mask_max,green_rest_max,green_mask_mean,green_rest_mean,green_mask_median,green_rest_median,green_mask_std,green_rest_std,green_mask_q1,green_rest_q1,green_mask_q3,green_rest_q3,blue_mask_min,blue_rest_min,blue_mask_max,blue_rest_max,blue_mask_mean,blue_rest_mean,blue_mask_median,blue_rest_median,blue_mask_std,blue_rest_std,blue_mask_q1,blue_rest_q1,blue_mask_q3,blue_rest_q3,hue_mask_min,hue_rest_min,hue_mask_max,hue_rest_max,hue_mask_mean,hue_rest_mean,hue_mask_median,hue_rest_median,hue_mask_std,hue_rest_std,hue_mask_q1,hue_rest_q1,hue_mask_q3,hue_rest_q3,saturation_mask_min,saturation_rest_min,saturation_mask_max,saturation_rest_max,saturation_mask_mean,saturation_rest_mean,saturation_mask_median,saturation_rest_median,saturation_mask_std,saturation_rest_std,saturation_mask_q1,saturation_rest_q1,saturation_mask_q3,saturation_rest_q3,value_mask_min,value_rest_min,value_mask_max,value_rest_max,value_mask_mean,value_rest_mean,value_mask_median,value_rest_median,value_mask_std,value_rest_std,value_mask_q1,value_rest_q1,value_mask_q3,value_rest_q3,fourier_descriptor_real_0,fourier_descriptor_imag_0,fourier_descriptor_real_1,fourier_descriptor_imag_1,fourier_descriptor_real_2,fourier_descriptor_imag_2,fourier_descriptor_real_3,fourier_descriptor_imag_3,fourier_descriptor_real_4,fourier_descriptor_imag_4,fourier_descriptor_real_5,fourier_descriptor_imag_5,fourier_descriptor_real_6,fourier_descriptor_imag_6,fourier_descriptor_real_7,fourier_descriptor_imag_7,fourier_descriptor_real_8,fourier_descriptor_imag_8,fourier_descriptor_real_9,fourier_descriptor_imag_9
251,Bee,0.01257,34.23204,39.9125,0.95504,0.07598,363.48239,127.25453,712.9671,202.40252,0.78507,-0.20017,0.87438,301712,7064,0.07598,527273.5,0.57221,0.43324,0.70862,139.49823,0.78557,0.61878,0.10754,-0.77724,0.62921,0.33759,722.79782,2138.32418,757.01299,347.60211,6.84842,7.45157,5,12,255,255,112.18456,94.19645,108,93,47.17093,31.21034,81,75,141,109,0,5,255,252,100.73274,111.24017,104,116,47.44601,32.86262,63,94,130,133,0,10,255,255,72.33736,55.56671,61,47,42.76035,36.83415,42,36,99,60,0,0,179,179,28.17971,44.50369,19,41,32.10372,25.78548,14,38,30,42,0,0,255,222,102.4894,140.8289,103,147,41.54535,30.4978,71,133,133,160,6,26,255,255,114.22507,115.02239,112,118,47.10202,35.00454,82,96,142,135,0.824,0.56659,-0.07437,0.00186,-0.0239,-0.00102,-0.01529,-0.00136,-0.00884,0.00228,-0.01644,0.00043,-0.00245,-0.00314,-0.00706,0.00896,-0.01105,-0.00633,0.00069,0.00096
252,Bee,0.01566,28.47457,38.19924,0.41803,0.12835,352.87794,99.59866,521.37217,117.22277,-0.28527,-0.61249,0.83489,375834,6066,0.12835,465431.5,0.8075,0.43988,0.61088,123.54064,0.56102,0.8278,0.09263,0.50012,-0.86595,0.48542,511.28174,919.058,217.54884,329.96495,6.65313,7.43475,22,21,245,255,147.94102,88.31623,150,82,46.76533,31.83025,113,68,187,99,11,9,224,235,127.09688,89.62826,128,86,44.49762,26.93763,95,73,163,100,8,12,255,255,132.1523,88.16771,132,77,48.1988,46.79911,96,58,170,105,0,0,179,179,96.70796,63.74859,135,51,71.96495,42.53964,11,39,159,90,0,0,191,220,51.29142,59.68634,43,49,28.56344,36.60151,30,33,67,77,22,21,255,255,151.16131,101.61526,154,90,47.33611,40.67607,115,78,190,113,0.79752,0.60329,-0.03767,0.01904,-0.0282,0.00724,-0.01625,-0.00453,-0.00886,0.00425,-0.00996,0.00278,-0.00426,0.00406,-0.0027,0.00342,-0.00991,-0.00144,-0.00331,0.0013
253,Bee,0.01393,45.45621,32.55122,0.65286,0.08935,327.34657,91.43938,508.69088,128.43417,0.00274,-0.90335,0.78172,334305,6857,0.08935,455223.5,0.73438,0.37611,0.78761,1.89859,0.83032,0.55728,0.09801,-0.99423,-0.10724,0.49445,589.5436,1814.04999,531.97615,396.2756,7.25512,7.35176,10,13,255,255,115.8652,136.3177,122,126,47.18683,57.61107,82,95,147,165,4,0,255,247,84.88433,110.47203,87,110,42.12061,35.10449,52,91,112,131,0,0,255,255,64.08178,90.57136,61,81,37.6785,44.50656,34,59,86,113,0,0,179,179,20.55786,50.97725,12,30,35.58381,55.82293,9,14,15,43,0,0,255,255,119.96105,103.67877,119,102,43.28853,31.99164,91,84,148,121,10,24,255,255,115.98948,141.47719,122,130,47.18363,55.16421,82,105,147,167,0.78936,0.61394,-0.02043,0.01232,-0.01407,-0.00816,-0.00637,0.00269,-0.01185,-0.0102,-0.0053,-0.00549,-0.00347,0.0027,-0.00773,-0.00477,-0.0037,0.00252,-0.00508,-0.00334
254,Bee,0.01307,32.21554,64.29354,0.29609,0.06004,324.87634,106.94546,578.68909,96.34053,0.06806,-0.6116,0.91705,313608,8102,0.06004,515117.0,0.60881,0.34513,0.99453,3.38595,0.88196,0.47132,0.09923,-0.49483,-0.86899,0.45389,385.22782,1147.53317,334.06196,379.90891,6.35823,7.48906,32,14,255,255,171.20265,84.93624,176,74,55.25537,41.66422,126,64,220,86,4,0,255,253,130.79711,90.79172,130,88,47.20918,28.15106,99,76,171,96,16,0,255,255,121.21824,73.96998,118,62,41.90999,43.73244,94,53,150,74,0,0,179,179,72.05272,57.42459,14,47,76.05883,32.82692,10,45,163,51,0,0,229,255,82.78141,73.41247,79,75,36.58201,24.50431,58,62,105,86,32,14,255,255,171.77708,97.15832,176,88,54.84206,40.16134,127,77,220,98,0.82709,0.56206,-0.05501,0.00999,-0.01465,0.00194,-0.01767,-0.00089,-0.01223,0.00433,-0.00986,-0.00065,-0.00922,0.00289,-0.00432,0.00194,-0.00719,-0.00231,-0.00776,0.00225
255,Bee,0.01447,26.49725,48.01838,0.54083,0.09671,360.99218,90.90101,537.58337,188.61799,-0.08146,-1.15878,0.68367,347343,6718,0.09671,485418.5,0.71555,0.40564,0.66486,90.90733,0.60589,0.79555,0.09319,0.03246,-0.99947,0.65309,1502.11927,5822.25167,2494.23646,387.66667,6.60365,7.48284,10,20,255,255,127.44884,107.81115,128,105,47.58578,29.11961,94,89,163,121,0,1,255,239,98.10065,117.8465,99,118,44.68724,23.67047,65,103,129,133,0,8,255,255,97.33894,93.85242,97,83,46.72134,44.24134,62,64,130,109,0,0,179,179,69.43894,59.96029,13,44,72.78692,35.85761,8,42,148,47,0,0,255,248,85.7505,87.80545,81,89,43.8215,31.35753,53,68,114,111,10,30,255,255,130.01306,127.20454,132,123,47.17363,32.10992,98,108,164,139,0.8436,0.53697,-0.05266,0.008,-0.01669,-0.00197,-0.01223,0.00361,-0.00947,-0.002,-0.00968,0.00092,-0.00946,-0.00078,-0.00401,4e-05,-0.00761,-0.00092,-0.00696,-0.00079


## B - Check coherence of results

We can now check if the computed features are coherent and not too far in values from those of the train data. If there was an issue in the test set, it could heavily hinder the performance of our algorithm. For that, we will make a table with statistics on features from train and test data.

In [61]:
get_missing(data)

Unnamed: 0,Nom de colonne,Pourcentage de NA
bug_type,bug_type,0.0
saturation_mask_q1,saturation_mask_q1,0.0
saturation_rest_std,saturation_rest_std,0.0
saturation_mask_std,saturation_mask_std,0.0
saturation_rest_median,saturation_rest_median,0.0
saturation_mask_median,saturation_mask_median,0.0
saturation_rest_mean,saturation_rest_mean,0.0
saturation_mask_mean,saturation_mask_mean,0.0
saturation_rest_max,saturation_rest_max,0.0
saturation_mask_max,saturation_mask_max,0.0


In [62]:
data

Unnamed: 0,bug_type,nb_pixels_ratio,image_symmetry_index,mask_bb_symmetry_index,orthogonal_lines_ratio,roundness,mean_centroid_distance,std_centroid_distance,max_centroid_distance,min_centroid_distance,skewness_centroid_distance,kurtosis_centroid_distance,aspect_ratio,mask_area,mask_perimeter,mask_compactness,hull_area,hull_to_insect_area_ratio,hull_convexity,hull_triangle_similarity,ellipse_angle,ellipse_axis_ratio,ellipse_eccentricity,ellipse_variance,axis_least_inertia_x,axis_least_inertia_y,rectangularity,body_parts_mean_length,body_parts_max_length,body_parts_std_length,body_parts_spread,rest_entropy,mask_entropy,red_mask_min,red_rest_min,red_mask_max,red_rest_max,red_mask_mean,red_rest_mean,red_mask_median,red_rest_median,red_mask_std,red_rest_std,red_mask_q1,red_rest_q1,red_mask_q3,red_rest_q3,green_mask_min,green_rest_min,green_mask_max,green_rest_max,green_mask_mean,green_rest_mean,green_mask_median,green_rest_median,green_mask_std,green_rest_std,green_mask_q1,green_rest_q1,green_mask_q3,green_rest_q3,blue_mask_min,blue_rest_min,blue_mask_max,blue_rest_max,blue_mask_mean,blue_rest_mean,blue_mask_median,blue_rest_median,blue_mask_std,blue_rest_std,blue_mask_q1,blue_rest_q1,blue_mask_q3,blue_rest_q3,hue_mask_min,hue_rest_min,hue_mask_max,hue_rest_max,hue_mask_mean,hue_rest_mean,hue_mask_median,hue_rest_median,hue_mask_std,hue_rest_std,hue_mask_q1,hue_rest_q1,hue_mask_q3,hue_rest_q3,saturation_mask_min,saturation_rest_min,saturation_mask_max,saturation_rest_max,saturation_mask_mean,saturation_rest_mean,saturation_mask_median,saturation_rest_median,saturation_mask_std,saturation_rest_std,saturation_mask_q1,saturation_rest_q1,saturation_mask_q3,saturation_rest_q3,value_mask_min,value_rest_min,value_mask_max,value_rest_max,value_mask_mean,value_rest_mean,value_mask_median,value_rest_median,value_mask_std,value_rest_std,value_mask_q1,value_rest_q1,value_mask_q3,value_rest_q3,fourier_descriptor_real_0,fourier_descriptor_imag_0,fourier_descriptor_real_1,fourier_descriptor_imag_1,fourier_descriptor_real_2,fourier_descriptor_imag_2,fourier_descriptor_real_3,fourier_descriptor_imag_3,fourier_descriptor_real_4,fourier_descriptor_imag_4,fourier_descriptor_real_5,fourier_descriptor_imag_5,fourier_descriptor_real_6,fourier_descriptor_imag_6,fourier_descriptor_real_7,fourier_descriptor_imag_7,fourier_descriptor_real_8,fourier_descriptor_imag_8,fourier_descriptor_real_9,fourier_descriptor_imag_9
251,Bee,0.01257,34.23204,39.9125,0.95504,0.07598,363.48239,127.25453,712.9671,202.40252,0.78507,-0.20017,0.87438,301712,7064,0.07598,527273.5,0.57221,0.43324,0.70862,139.49823,0.78557,0.61878,0.10754,-0.77724,0.62921,0.33759,722.79782,2138.32418,757.01299,347.60211,6.84842,7.45157,5,12,255,255,112.18456,94.19645,108,93,47.17093,31.21034,81,75,141,109,0,5,255,252,100.73274,111.24017,104,116,47.44601,32.86262,63,94,130,133,0,10,255,255,72.33736,55.56671,61,47,42.76035,36.83415,42,36,99,60,0,0,179,179,28.17971,44.50369,19,41,32.10372,25.78548,14,38,30,42,0,0,255,222,102.4894,140.8289,103,147,41.54535,30.4978,71,133,133,160,6,26,255,255,114.22507,115.02239,112,118,47.10202,35.00454,82,96,142,135,0.824,0.56659,-0.07437,0.00186,-0.0239,-0.00102,-0.01529,-0.00136,-0.00884,0.00228,-0.01644,0.00043,-0.00245,-0.00314,-0.00706,0.00896,-0.01105,-0.00633,0.00069,0.00096
252,Bee,0.01566,28.47457,38.19924,0.41803,0.12835,352.87794,99.59866,521.37217,117.22277,-0.28527,-0.61249,0.83489,375834,6066,0.12835,465431.5,0.8075,0.43988,0.61088,123.54064,0.56102,0.8278,0.09263,0.50012,-0.86595,0.48542,511.28174,919.058,217.54884,329.96495,6.65313,7.43475,22,21,245,255,147.94102,88.31623,150,82,46.76533,31.83025,113,68,187,99,11,9,224,235,127.09688,89.62826,128,86,44.49762,26.93763,95,73,163,100,8,12,255,255,132.1523,88.16771,132,77,48.1988,46.79911,96,58,170,105,0,0,179,179,96.70796,63.74859,135,51,71.96495,42.53964,11,39,159,90,0,0,191,220,51.29142,59.68634,43,49,28.56344,36.60151,30,33,67,77,22,21,255,255,151.16131,101.61526,154,90,47.33611,40.67607,115,78,190,113,0.79752,0.60329,-0.03767,0.01904,-0.0282,0.00724,-0.01625,-0.00453,-0.00886,0.00425,-0.00996,0.00278,-0.00426,0.00406,-0.0027,0.00342,-0.00991,-0.00144,-0.00331,0.0013
253,Bee,0.01393,45.45621,32.55122,0.65286,0.08935,327.34657,91.43938,508.69088,128.43417,0.00274,-0.90335,0.78172,334305,6857,0.08935,455223.5,0.73438,0.37611,0.78761,1.89859,0.83032,0.55728,0.09801,-0.99423,-0.10724,0.49445,589.5436,1814.04999,531.97615,396.2756,7.25512,7.35176,10,13,255,255,115.8652,136.3177,122,126,47.18683,57.61107,82,95,147,165,4,0,255,247,84.88433,110.47203,87,110,42.12061,35.10449,52,91,112,131,0,0,255,255,64.08178,90.57136,61,81,37.6785,44.50656,34,59,86,113,0,0,179,179,20.55786,50.97725,12,30,35.58381,55.82293,9,14,15,43,0,0,255,255,119.96105,103.67877,119,102,43.28853,31.99164,91,84,148,121,10,24,255,255,115.98948,141.47719,122,130,47.18363,55.16421,82,105,147,167,0.78936,0.61394,-0.02043,0.01232,-0.01407,-0.00816,-0.00637,0.00269,-0.01185,-0.0102,-0.0053,-0.00549,-0.00347,0.0027,-0.00773,-0.00477,-0.0037,0.00252,-0.00508,-0.00334
254,Bee,0.01307,32.21554,64.29354,0.29609,0.06004,324.87634,106.94546,578.68909,96.34053,0.06806,-0.6116,0.91705,313608,8102,0.06004,515117.0,0.60881,0.34513,0.99453,3.38595,0.88196,0.47132,0.09923,-0.49483,-0.86899,0.45389,385.22782,1147.53317,334.06196,379.90891,6.35823,7.48906,32,14,255,255,171.20265,84.93624,176,74,55.25537,41.66422,126,64,220,86,4,0,255,253,130.79711,90.79172,130,88,47.20918,28.15106,99,76,171,96,16,0,255,255,121.21824,73.96998,118,62,41.90999,43.73244,94,53,150,74,0,0,179,179,72.05272,57.42459,14,47,76.05883,32.82692,10,45,163,51,0,0,229,255,82.78141,73.41247,79,75,36.58201,24.50431,58,62,105,86,32,14,255,255,171.77708,97.15832,176,88,54.84206,40.16134,127,77,220,98,0.82709,0.56206,-0.05501,0.00999,-0.01465,0.00194,-0.01767,-0.00089,-0.01223,0.00433,-0.00986,-0.00065,-0.00922,0.00289,-0.00432,0.00194,-0.00719,-0.00231,-0.00776,0.00225
255,Bee,0.01447,26.49725,48.01838,0.54083,0.09671,360.99218,90.90101,537.58337,188.61799,-0.08146,-1.15878,0.68367,347343,6718,0.09671,485418.5,0.71555,0.40564,0.66486,90.90733,0.60589,0.79555,0.09319,0.03246,-0.99947,0.65309,1502.11927,5822.25167,2494.23646,387.66667,6.60365,7.48284,10,20,255,255,127.44884,107.81115,128,105,47.58578,29.11961,94,89,163,121,0,1,255,239,98.10065,117.8465,99,118,44.68724,23.67047,65,103,129,133,0,8,255,255,97.33894,93.85242,97,83,46.72134,44.24134,62,64,130,109,0,0,179,179,69.43894,59.96029,13,44,72.78692,35.85761,8,42,148,47,0,0,255,248,85.7505,87.80545,81,89,43.8215,31.35753,53,68,114,111,10,30,255,255,130.01306,127.20454,132,123,47.17363,32.10992,98,108,164,139,0.8436,0.53697,-0.05266,0.008,-0.01669,-0.00197,-0.01223,0.00361,-0.00947,-0.002,-0.00968,0.00092,-0.00946,-0.00078,-0.00401,4e-05,-0.00761,-0.00092,-0.00696,-0.00079
256,Bee,0.02396,34.75761,68.92763,0.6987,0.0757,453.07702,119.7612,761.75258,216.00148,0.19527,-0.47171,0.77198,575110,9771,0.0757,871701.5,0.65976,0.37177,0.79994,154.87961,0.72624,0.68745,0.09829,-0.92744,0.37398,0.41366,1069.71396,3178.6673,1038.84343,492.1596,6.44331,7.28302,0,8,255,255,85.78969,139.00471,83,141,50.56077,31.66097,44,136,119,146,0,0,255,255,60.88877,105.11402,51,101,42.5179,34.64504,29,95,83,127,0,0,255,255,58.01538,109.01675,44,96,47.84032,46.33057,23,90,80,125,0,0,179,179,74.01549,43.99058,18,9,72.76999,62.11653,11,4,158,32,0,0,255,255,106.03899,82.25674,105,83,55.40819,34.43433,65,64,145,92,0,8,255,255,86.7981,141.90526,83,141,50.92301,36.60321,45,137,119,147,0.85494,0.51873,-0.04017,0.00797,-0.02312,0.01043,-0.02496,0.00604,-0.02078,-0.00121,-0.00356,-0.00245,-0.00132,-0.00128,-0.00646,-0.00016,-0.00238,0.00816,-0.00856,0.00173
257,Bee,0.00804,43.46353,60.98418,0.42319,0.04843,244.61428,84.12521,446.83278,83.58523,0.07694,-0.82956,0.83229,192912,7075,0.04843,345865.0,0.55777,0.32087,0.99076,80.01601,0.89581,0.44444,0.09896,-0.42556,-0.90493,0.36307,518.61702,1457.2834,435.18104,283.56122,7.47934,7.57485,6,6,255,255,115.91941,97.1281,111,93,57.31766,48.12512,70,56,163,129,0,10,254,255,93.47043,115.53764,88,114,50.67062,47.4054,51,75,129,152,0,1,255,255,67.24771,67.59302,52,56,49.25738,48.39743,30,33,95,85,0,0,179,179,25.63662,46.17302,15,43,34.35256,20.49681,12,40,23,46,0,0,255,250,116.01113,118.3394,117,131,51.49222,49.31403,79,99,154,152,8,10,255,255,117.41781,116.75996,113,115,56.41185,47.90617,74,76,163,153,0.76426,0.6449,-0.03331,0.00352,-0.01129,0.00193,-0.00823,0.00332,-0.00985,-0.00477,-0.00483,-0.00094,-0.00186,-0.00217,-0.00407,0.00156,-0.00854,-0.00413,-0.00078,-0.00181
258,Bee,0.01769,39.98478,49.60832,0.68025,0.06365,408.71703,144.01844,718.18485,122.3917,0.09562,-0.92755,0.85113,424512,9155,0.06365,744687.5,0.57005,0.38416,0.83974,28.0758,0.56883,0.82245,0.08287,-0.96334,-0.26829,0.32648,713.76588,3612.72324,1042.42259,384.61333,7.23526,7.02287,4,8,255,255,86.94146,119.59266,94,113,39.86481,41.78623,55,94,112,145,1,0,255,255,71.24708,102.25729,67,107,41.78582,39.47718,33,82,106,126,0,0,255,255,51.05969,95.94661,49,81,32.68891,54.70787,23,57,70,126,0,0,179,179,28.42781,74.32776,17,35,40.03343,60.52323,10,28,27,141,0,0,255,255,114.20386,102.66369,108,104,41.03282,35.51962,86,81,143,125,6,8,255,255,87.15505,126.09137,94,119,39.85728,44.561,55,100,112,150,0.82675,0.56257,-0.04096,0.00766,-0.00801,0.00667,-0.00485,-0.03297,-0.01712,-0.00544,-0.01448,-0.00749,-0.00736,-0.00535,-0.00712,0.00235,-0.00209,-0.00956,-0.00197,-0.00277
259,Bee,0.01954,42.53416,43.63771,0.78339,0.07089,416.02833,116.01024,671.79946,186.819,0.21878,-0.96461,0.9341,468873,9117,0.07089,725476.0,0.6463,0.37432,0.98719,1.5224,0.93705,0.34919,0.10029,-0.56551,-0.82474,0.34058,981.32528,3108.79934,928.28769,375.9117,6.97083,7.36567,11,13,255,255,128.59411,99.81394,122,94,46.45078,38.36116,102,74,150,113,0,0,255,255,91.89732,103.16564,92,101,44.50965,34.79646,60,81,112,122,0,0,255,255,78.04001,67.69443,77,57,50.32067,41.79074,35,45,106,73,0,0,179,179,41.27673,38.75938,11,35,63.20975,29.23437,8,29,17,40,0,0,255,255,113.11562,105.46261,106,108,63.99561,35.4623,59,87,167,130,11,13,255,255,128.82359,108.86495,122,103,46.62601,38.2624,102,85,150,124,0.76956,0.63857,-0.02531,-0.01219,-0.03268,0.00841,-0.00448,0.00083,-0.01,-0.00779,-0.01187,0.00742,-0.00789,-0.005,-0.00954,0.00374,-0.00529,-0.00496,-0.0036,-0.00352
260,Bee,0.01145,40.48043,45.48228,0.91969,0.06687,306.08773,88.56946,488.8654,108.78833,-0.24516,-0.55171,0.89163,274704,7185,0.06687,411988.5,0.66678,0.35057,0.96461,178.83757,0.91702,0.39884,0.10121,0.48031,-0.8771,0.46727,594.21167,2173.28461,646.14207,282.47386,7.18316,6.91379,7,3,255,255,69.99293,103.69377,63,100,43.42533,41.08399,34,75,100,126,0,3,241,255,54.80465,111.85303,38,111,40.66118,35.61511,22,86,87,137,0,0,255,255,41.01431,69.47615,29,58,32.59493,40.38885,18,41,56,87,0,0,179,179,33.78986,44.6743,15,39,49.86199,31.08204,8,34,27,42,0,0,255,255,114.2991,109.95293,113,109,39.34569,39.47779,87,81,138,137,7,4,255,255,70.61137,115.84487,63,113,43.46274,39.19605,34,87,101,141,0.69545,0.71857,-0.04222,-0.00826,-0.00828,0.00353,-0.00284,0.00607,-0.00323,0.00817,-0.00702,0.0097,-0.00775,0.00408,-0.00717,0.00148,-0.00662,-0.00135,-0.00498,-0.0021


In [None]:
train_data = pd.read_csv('data/processed_data.csv',header=0,index_col='ID')

In [None]:
# skewness_centroid_distance : diff mean

In [67]:
# superimpose vertically each stats of the describe function of the dataframes data and train_data
pd.concat([data.describe().T, train_data.describe().T], axis=1, keys=['data', 'train_data'])

Unnamed: 0_level_0,data,data,data,data,data,data,data,data,train_data,train_data,train_data,train_data,train_data,train_data,train_data,train_data
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
nb_pixels_ratio,97.0,0.02194,0.01539,0.00559,0.0149,0.01876,0.02509,0.143836,249.0,0.02291,0.01686,0.00362,0.01338,0.02005,0.02838,0.187458
image_symmetry_index,97.0,39.71137,11.11911,17.8361,32.21554,40.5273,45.45621,70.5274,249.0,41.20895,12.44419,17.35951,32.94044,39.67906,47.68733,84.9593
mask_bb_symmetry_index,97.0,50.07886,11.71375,22.09118,42.02613,49.27934,57.15405,78.4151,249.0,50.60318,12.01186,18.1041,42.67664,49.8213,57.85035,86.7271
orthogonal_lines_ratio,97.0,0.66723,0.18233,0.29233,0.52488,0.67015,0.80454,1.0,249.0,0.65815,0.18643,0.23966,0.50415,0.6578,0.79267,1.0
roundness,97.0,0.09336,0.03337,0.02337,0.07089,0.08683,0.11314,0.209421,249.0,0.07757,0.0331,0.02198,0.05362,0.07151,0.09212,0.229485
mean_centroid_distance,97.0,419.71483,117.92875,230.63931,352.87794,408.71703,462.14663,1112.07,249.0,424.99955,142.7636,151.36897,330.8186,409.42561,488.72257,1258.71
std_centroid_distance,97.0,119.56817,36.19424,40.58083,99.52756,116.01024,138.77188,280.745,249.0,120.97559,51.77786,27.40385,89.80869,112.4572,138.3341,419.038
max_centroid_distance,97.0,687.11171,192.23473,316.25696,574.53226,662.37478,762.14499,1692.97,249.0,694.67362,250.81663,259.8489,541.90994,666.49125,782.52494,2024.02
min_centroid_distance,97.0,171.09102,90.5323,10.57534,117.22277,167.0637,205.21316,680.329,249.0,174.95151,91.27019,2.14227,109.50865,166.91921,229.02068,669.67
skewness_centroid_distance,97.0,0.02713,0.30987,-1.04323,-0.17487,0.01986,0.21878,0.785066,249.0,0.06733,0.29317,-0.8492,-0.11056,0.06842,0.26124,1.12185


This extract of the comparative table shows nothing alarming, so we can save our test data to test it with the models we prepared on the train set.


In [65]:
data.to_csv("data/processed_test_data.csv")