# Analysis of Gradient Boosting Results + Complexity Measures per Level of complexity

Now we have a total of 112 datasets. We want to split them according to their level of complexity and study the results in the different categories. The objective is to investigate if we obtain better results than classic boosting for some levels of complexity (for example, for the hardest datasets).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import glob
import math


os.chdir("..")
root_path = os.getcwd()

In [2]:
path_csv = os.path.join(root_path, 'Results_GB')
os.chdir(path_csv)

In [3]:
#colour_palette_personalized = ["#FFD700", "#00CED1", "#FF1493","#F1F1F1"]
colour_palette_personalized = {
    "classic": "#FFD700",   # yellow
    "sample_weight_easy": "#C7F7FF", # blue
    "sample_weight_easy_x2": "#00CED1", # blue
    "sample_weight_hard": "#FFB3DA",    # magenta
    "sample_weight_hard_x2": "#FF1493",    # magenta
    "Classic": "#FFD700",   # yellow
    "Easy": "#00CED1", # blue
    "Hard": "#FF1493",    # magenta
}



In [4]:
specific_path = os.path.join(path_csv, '*Aggregated*.csv')
selected_files = glob.glob(specific_path)
all_datasets = pd.concat([pd.read_csv(f) for f in selected_files], ignore_index=True)

We already have the complexity characteristics of each dataset in the csv complex_info_dataset_20250115.csv. We read it and split the datasets according to the complexity values of each complexity measure.

In [8]:
path_complex = os.path.join(root_path, 'datasets/complexity_info')
os.chdir(path_complex)
df_complex = pd.read_csv('complex_info_dataset_20250115.csv')
df_complex.head()

Unnamed: 0,dataset,Hostility,kDN,DS,DCP,TD_U,TD_P,MV,CB,CLD,...,N2,LSC,LSradius,H,U,F1,F2,F3,F4,dataset.1
0,analcatdata_gviolence,0.22973,0.083784,0.219595,0.050778,0.489865,1.0,0.116908,0.657901,0.18785,...,0.249374,0.589575,0.608601,0.027027,0.600292,0.405405,0.0,0.612981,0.91109,analcatdata_gviolence
1,analcatdata_japansolvent,0.25,0.265385,0.384615,0.0,0.461538,0.461538,0.035613,0.666174,0.336514,...,0.400943,0.768121,0.791536,0.038462,0.776809,0.766827,0.538542,0.75722,0.916458,analcatdata_japansolvent
2,analcatdata_vineyard,0.15812,0.188462,0.605078,0.493827,0.500986,,0.088889,0.662551,0.418386,...,0.245928,0.975808,0.948771,0.004274,0.975912,0.849003,0.736191,0.799548,0.862728,analcatdata_vineyard
3,arrhythmia_cfs,0.205752,0.316372,0.573684,0.177085,0.572607,0.692232,0.071031,0.664311,0.331718,...,0.471772,0.981821,0.715307,0.004425,0.981899,0.745515,0.603953,0.801924,0.989752,arrhythmia_cfs
4,Australian,0.175362,0.185507,0.615459,0.211243,0.568237,0.761594,0.088288,0.662623,0.22064,...,0.398111,0.969318,0.799725,0.002899,0.969406,0.519876,0.664697,0.757345,0.995492,Australian


In [9]:
list_CM = ['Hostility','kDN','DCP','TD_U','CLD','N1','N2','LSC','F1']

Para este análisis lo que voy a hacer es:
 * Dividir el rango de cada medida de complejidad en 3: fácil, medio, difícil. Esto lo haré automáticamente con alguna función de python puesto que para algunas medidas (por ejemplo, F1 cuyo rango es [0,1] pero con valores muy concentrados en torno al 0.9, no sé interpretar los valores).
  * Para cada medida de complejidad, estudiar los resultados en estos cortes. Miraré media, mediana y std de accuracy y el WTL.
  * Luego hago el estudio desde la otra perspectiva. Divido los datasets en función de si, con esa medida de complejidad, gano, empato o pierdo. Creo esas categorías y hago una análisis exploratorio de las mismas. Por ejemplo, grafico la complejidad de dichas categorías mediante boxplots.