# Generador de los datos de entrenamiento

## Objetivos del análisis
* Extraer el data frame final con los datos preparados para entrenar algoritmos machine learning.

## Descripción de la muestra

El DataFrame en cuestión está formado por las características extraídas de un array de datos al comprimirlo y descomprimirlo mediante blosc. En cada fichero aparecen distintos conjuntos de datos los cuáles dividimos en fragmentos de 16 MegaBytes y sobre los cuales realizamos las pruebas de compresión y decompresión.  
Cada fila se corresponde con los datos de realizar los test de compresión sobre un fragmento (*chunk*) de datos específico con un tamaño de bloque, codec, filtro y nivel de compresión determinados.

Variable | Descripción
-------------  | -------------
*Filename* | nombre del fichero del que proviene.
*DataSet* | dentro del fichero el conjunto de datos del que proviene.
*Table* | 0 si los datos vienen de un array, 1 si vienen de tablas y 2 para tablas columnares.
*DType* | indica el tipo de los datos.
*Chunk_Number* | número de fragmento dentro del conjunto de datos.
*Chunk_Size* | tamaño del fragmento.
*Mean* | la media.
*Median* | la mediana.
*Sd* | la desviación típica.
*Skew* | el coeficiente de asimetría.
*Kurt* | el coeficiente de apuntamiento.
*Min* | el mínimo absoluto.
*Max* | el máximo absoluto.
*Q1* | el primer cuartil.
*Q3* | el tercer cuartil.
*N_Streaks* | número de rachas seguidas por encima o debajo de la mediana.
*Block_Size* | el tamaño de bloque que utilizará Blosc para comprimir.
*Codec* | el codec de blosc utilizado.
*Filter* | el filtro de blosc utilizado.
*CL* | el nivel de compresión utilizado.
*CRate* | el ratio de compresión obtenido.
*CSpeed* | la velocidad de compresión obtenida en GB/s.
*DSpeed* | la velocidad de decompresión obtenida en GB/s.

In [1]:
%load_ext autoreload
%autoreload 2

%load_ext version_information
%version_information numpy, scipy, matplotlib, pandas

Software,Version
Python,3.5.2 64bit [MSC v.1900 64 bit (AMD64)]
IPython,5.1.0
OS,Windows 10 10.0.14393 SP0
numpy,1.11.3
scipy,0.19.0
matplotlib,2.0.0
pandas,0.19.2
Sat Mar 25 19:49:15 2017 Hora estándar romance,Sat Mar 25 19:49:15 2017 Hora estándar romance


In [2]:
import numpy as np
import pandas as pd
from IPython.display import display

pd.options.display.float_format = '{:,.3f}'.format

In [3]:
CHUNK_ID = ["Filename", "DataSet", "Table", "Chunk_Number"]
#CHUNK_FEATURES = ["Table", "DType", "Chunk_Size", "Mean", "Median", "Sd", "Skew", "Kurt", "Min", "Max", "Q1", "Q3", "N_Streaks"]
CHUNK_FEATURES = ["Table", "DType", "Chunk_Size", "Mean", "Median", "Sd", "Skew", "Kurt", "Min", "Max", "Q1", "Q3"]
OUT_OPTIONS = ["Block_Size", "Codec", "Filter", "CL"]
TEST_FEATURES = ["CRate", "CSpeed", "DSpeed"]
COLS = ["Filename" , "DataSet", "Chunk_Number"] + CHUNK_FEATURES + OUT_OPTIONS + TEST_FEATURES
IN_TESTS = ['BLZ_CRate', 'BLZ_CSpeed', 'BLZ_DSpeed', 'LZ4_CRate', 'LZ4_CSpeed', 'LZ4_DSpeed']
IN_USER = ['IN_CR', 'IN_CS', 'IN_DS']

In [4]:
df = pd.read_csv('../data/blosc_test_data_v2.csv.gz', sep='\t')
my_df = df[(df.Filename != 'WRF_India-LSD1.h5') & (df.Filename != 'WRF_India-LSD2.h5') 
           & (df.Filename != 'WRF_India-LSD3.h5') & (df.CL != 0) & (df.CRate > 1.1)]

In [5]:
# DATAFRAME WITH DISTINCT CHUNKS
chunks_df = my_df.drop_duplicates(subset=CHUNK_ID)
print("%d rows" % chunks_df.shape[0])
chunk_tests_list = []
# FOR EACH CHUNK
for index, row in chunks_df.iterrows():
    # DATAFRAME WITH CHUNK TESTS
    chunk_tests_list.append(my_df[(my_df.Filename == row["Filename"]) & (my_df.DataSet == row["DataSet"]) &
                        (my_df.Table == row["Table"]) & (my_df.Chunk_Number == row["Chunk_Number"])])

673 rows


In [6]:
training_df = pd.DataFrame()
for chunk_test in chunk_tests_list:
    # EXTRACT MAX MIN AND SOME AUX MAX INDICES
    i_max_crate, i_max_c_speed, i_max_d_speed = chunk_test['CRate'].idxmax(), chunk_test['CSpeed'].idxmax(),\
                                                chunk_test['DSpeed'].idxmax()
    max_crate, max_c_speed, max_d_speed = (chunk_test.ix[i_max_crate]['CRate'], chunk_test.ix[i_max_c_speed]['CSpeed'],
                                           chunk_test.ix[i_max_d_speed]['DSpeed'])

    min_crate, min_c_speed, min_d_speed = (chunk_test['CRate'].min(), chunk_test['CSpeed'].min(),
                                           chunk_test['DSpeed'].min())
    # NORMALIZED COLUMNS
    chunk_test = chunk_test.assign(N_CRate=(chunk_test['CRate'] - min_crate) / (max_crate - min_crate),
                                   N_CSpeed=(chunk_test['CSpeed'] - min_c_speed) / (max_c_speed - min_c_speed),
                                   N_DSpeed=(chunk_test['DSpeed'] - min_d_speed) / (max_d_speed - min_d_speed))
    # DISTANCE FUNC COLUMNS
    chunk_test = chunk_test.assign(Distance_1=(chunk_test['N_CRate'] - 1)**2 + (chunk_test['N_CSpeed'] - 1)**2,
                                   Distance_2=(chunk_test['N_CRate'] - 1) ** 2 + (chunk_test['N_DSpeed'] - 1) ** 2,
                                   Distance_3=(chunk_test['N_CRate'] - 1) ** 2 + (chunk_test['N_DSpeed'] - 1) ** 2 +
                                              (chunk_test['N_CSpeed'] - 1) ** 2,
                                   Distance_4=(chunk_test['N_CSpeed'] - 1) ** 2 + (chunk_test['N_DSpeed'] - 1) ** 2
                                   )
    # BALANCED INDICES
    i_balanced_c_speed, i_balanced_d_speed, i_balanced, i_balanced_speeds = (chunk_test['Distance_1'].idxmin(),
                                                                             chunk_test['Distance_2'].idxmin(),
                                                                             chunk_test['Distance_3'].idxmin(),
                                                                             chunk_test['Distance_4'].idxmin())
    indices = [i_max_d_speed, i_max_c_speed, i_balanced_speeds, i_max_crate, i_balanced_d_speed, i_balanced_c_speed,
               i_balanced]
    # TYPE FILTER FOR LZ_DATA
    d_type = chunk_test.iloc[0]['DType']
    filter_name = 'noshuffle'
    if 'float' in d_type or 'int' in d_type:
        filter_name = 'shuffle'
    aux = df[(df.CL == 1) & (df.Block_Size == 0) & (df.Filter == filter_name) &
             (df.Filename == chunk_test.iloc[0]['Filename']) & (df.DataSet == chunk_test.iloc[0]['DataSet']) &
             (df.Table == chunk_test.iloc[0]['Table']) & (df.Chunk_Number == chunk_test.iloc[0]['Chunk_Number'])]
    lz_data = np.append(aux[aux.Codec == 'blosclz'][TEST_FEATURES].values[0],
                        aux[aux.Codec == 'lz4'][TEST_FEATURES].values[0])
    # APPEND ROWS TO TRAINING DATA FRAME
    for i in range(len(indices)):
        in_1, r = divmod(i, 4)
        in_2, in_3 = divmod(r, 2)
        training_df = training_df.append(dict(zip(COLS + IN_TESTS + IN_USER,
                                                  np.append(np.append(chunk_test.ix[indices[i]][COLS].values,
                                                                      lz_data),
                                                            [in_1, in_2, in_3]))),
                                         ignore_index=True)

# Some tests with the data generated

In [7]:
print('DISTINCT MAX RATE')
distinct_max_rate = training_df[(training_df.IN_CR == 1) & (training_df.IN_CS == 0) & (training_df.IN_DS == 0)]\
                    .drop_duplicates(subset=OUT_OPTIONS)[OUT_OPTIONS + TEST_FEATURES]
print('%s rows' % distinct_max_rate.shape[0])
display(distinct_max_rate)
print('DISTINCT MAX C.SPEED')
distinct_max_c_speed = training_df[(training_df.IN_CR == 0) & (training_df.IN_CS == 1) & (training_df.IN_DS == 0)]\
                       .drop_duplicates(subset=OUT_OPTIONS)[OUT_OPTIONS + TEST_FEATURES]
print('%s rows' % distinct_max_c_speed.shape[0])
display(distinct_max_c_speed)
print('DISTINCT MAX D.SPEED')
distinct_max_d_speed = training_df[(training_df.IN_CR == 0) & (training_df.IN_CS == 0) & (training_df.IN_DS == 1)]\
                      .drop_duplicates(subset=OUT_OPTIONS)[OUT_OPTIONS + TEST_FEATURES]
print('%s rows' % distinct_max_d_speed.shape[0])
display(distinct_max_d_speed)
print('DISTINCT BALANCED CSPEED')
distinct_balanced_c_speed = training_df[(training_df.IN_CR == 1) & (training_df.IN_CS == 1) & (training_df.IN_DS == 0)]\
                            .drop_duplicates(subset=OUT_OPTIONS)[OUT_OPTIONS + TEST_FEATURES]
print('%s rows' % distinct_balanced_c_speed.shape[0])
display(distinct_balanced_c_speed)
print('DISTINCT BALANCED DSPEED')
distinct_balanced_d_speed = training_df[(training_df.IN_CR == 1) & (training_df.IN_CS == 0) & (training_df.IN_DS == 1)]\
                            .drop_duplicates(subset=OUT_OPTIONS)[OUT_OPTIONS + TEST_FEATURES]
print('%s rows' % distinct_balanced_d_speed.shape[0])
display(distinct_balanced_d_speed)
print('DISTINCT BALANCED SPEED')
distinct_balanced_speed = training_df[(training_df.IN_CR == 0) & (training_df.IN_CS == 1) & (training_df.IN_DS == 1)]\
                          .drop_duplicates(subset=OUT_OPTIONS)[OUT_OPTIONS + TEST_FEATURES]
print('%s rows' % distinct_balanced_speed.shape[0])
display(distinct_balanced_speed)
print('DISTINCT BALANCED')
distinct_balanced = training_df[(training_df.IN_CR == 1) & (training_df.IN_CS == 1) & (training_df.IN_DS == 1)]\
                    .drop_duplicates(subset=OUT_OPTIONS)[OUT_OPTIONS + TEST_FEATURES]
print('%s rows' % distinct_balanced.shape[0])
display(distinct_balanced)

DISTINCT MAX RATE
151 rows


Unnamed: 0,Block_Size,Codec,Filter,CL,CRate,CSpeed,DSpeed
4,0.000,lz4hc,shuffle,7.000,1.309,0.154,8.543
11,256.000,lz4hc,shuffle,7.000,1.304,0.158,8.254
25,256.000,lz4hc,shuffle,9.000,1.296,0.108,8.355
32,256.000,lz4hc,shuffle,6.000,1.287,0.170,8.724
39,256.000,lz4hc,shuffle,8.000,1.291,0.119,8.267
53,128.000,lz4hc,shuffle,6.000,1.285,0.186,8.806
60,64.000,lz4hc,shuffle,8.000,1.287,0.178,8.797
81,128.000,lz4hc,shuffle,8.000,1.287,0.140,8.641
95,64.000,lz4hc,shuffle,9.000,1.284,0.150,9.005
102,512.000,lz4hc,shuffle,9.000,1.292,0.086,7.905


DISTINCT MAX C.SPEED
175 rows


Unnamed: 0,Block_Size,Codec,Filter,CL,CRate,CSpeed,DSpeed
2,0.000,lz4,shuffle,2.000,1.232,6.367,9.473
9,64.000,lz4,shuffle,2.000,1.243,6.593,9.027
16,32.000,lz4,shuffle,2.000,1.237,6.527,9.034
23,16.000,lz4,shuffle,2.000,1.219,6.308,8.844
30,16.000,lz4,shuffle,4.000,1.219,5.806,8.672
44,128.000,lz4,shuffle,1.000,1.227,5.833,8.947
58,8.000,lz4,shuffle,3.000,1.216,5.964,9.166
79,128.000,lz4,shuffle,3.000,1.233,5.717,8.856
86,32.000,lz4,shuffle,5.000,1.220,5.617,8.849
93,64.000,lz4,shuffle,3.000,1.224,5.633,8.753


DISTINCT MAX D.SPEED
124 rows


Unnamed: 0,Block_Size,Codec,Filter,CL,CRate,CSpeed,DSpeed
1,32.000,lz4,shuffle,3.000,1.239,6.426,8.218
8,64.000,lz4,shuffle,3.000,1.243,6.741,8.479
15,0.000,lz4,shuffle,1.000,1.231,6.680,7.969
22,16.000,lz4,shuffle,1.000,1.217,6.399,8.225
36,16.000,lz4,shuffle,2.000,1.218,6.212,8.456
50,64.000,lz4,shuffle,2.000,1.224,6.080,8.811
57,32.000,lz4,shuffle,1.000,1.226,6.138,8.379
99,0.000,lz4,shuffle,2.000,1.209,6.039,8.541
155,8.000,lz4,shuffle,1.000,1.244,6.565,8.764
330,64.000,lz4,shuffle,1.000,1.264,6.756,8.766


DISTINCT BALANCED CSPEED
161 rows


Unnamed: 0,Block_Size,Codec,Filter,CL,CRate,CSpeed,DSpeed
6,256.000,lz4,shuffle,8.000,1.257,5.292,8.872
13,128.000,lz4,shuffle,4.000,1.249,6.187,9.180
20,128.000,lz4,shuffle,5.000,1.248,6.137,9.384
27,128.000,lz4,shuffle,6.000,1.239,5.505,9.297
34,16.000,lz4,shuffle,7.000,1.224,5.491,8.799
41,128.000,lz4,shuffle,2.000,1.231,5.891,9.232
48,256.000,lz4,shuffle,4.000,1.233,5.384,8.492
55,256.000,lz4,shuffle,3.000,1.232,5.652,8.289
62,64.000,lz4,shuffle,8.000,1.238,5.413,8.356
76,128.000,lz4,shuffle,8.000,1.239,5.245,8.592


DISTINCT BALANCED DSPEED
103 rows


Unnamed: 0,Block_Size,Codec,Filter,CL,CRate,CSpeed,DSpeed
5,256.000,lz4,shuffle,6.000,1.254,5.649,8.630
19,256.000,lz4,shuffle,7.000,1.252,5.970,8.266
26,128.000,lz4,shuffle,7.000,1.241,5.405,8.265
33,128.000,lz4,shuffle,5.000,1.234,5.032,7.715
40,256.000,lz4,shuffle,5.000,1.237,5.601,7.957
61,64.000,lz4,shuffle,8.000,1.238,5.413,8.356
75,128.000,lz4,shuffle,8.000,1.239,5.245,8.592
89,16.000,lz4,bitshuffle,9.000,1.243,4.634,8.096
117,0.000,lz4,bitshuffle,2.000,1.251,5.478,8.234
138,16.000,lz4,bitshuffle,8.000,1.288,5.131,8.611


DISTINCT BALANCED SPEED
24 rows


Unnamed: 0,Block_Size,Codec,Filter,CL,CRate,CSpeed,DSpeed
3,2048.0,zstd,shuffle,9.0,1.366,0.014,3.364
654,0.0,zstd,shuffle,9.0,49.115,0.065,4.823
661,0.0,zstd,shuffle,8.0,1.212,0.012,1.125
668,0.0,zstd,shuffle,7.0,1.179,0.013,1.033
696,2048.0,zstd,bitshuffle,9.0,4.588,0.057,2.801
738,2048.0,zstd,noshuffle,9.0,2.915,0.004,1.211
745,0.0,zstd,noshuffle,7.0,1.769,0.073,2.247
864,2048.0,zstd,bitshuffle,7.0,3334.325,0.778,2.677
913,2048.0,zstd,shuffle,8.0,113.87,0.462,5.461
1004,2048.0,zstd,noshuffle,4.0,10645.442,6.216,13.089


DISTINCT BALANCED
0 rows


Unnamed: 0,Block_Size,Codec,Filter,CL,CRate,CSpeed,DSpeed


In [8]:
distinct_total = training_df.drop_duplicates(subset=OUT_OPTIONS)[OUT_OPTIONS + TEST_FEATURES]
print('%d distinct options from a total of %d' % (distinct_total.shape[0], 1620))
distinct_total_noblock = distinct_total.drop_duplicates(subset=OUT_OPTIONS[1:4])
print('%d distinct options from a total of %d' % (distinct_total_noblock.shape[0], 162))
print('Distinct codecs %d' % distinct_total.drop_duplicates(subset=['Codec']).shape[0])
print('Distinct filters %d' % distinct_total.drop_duplicates(subset=['Filter']).shape[0])
print('Distinct CL %d' % distinct_total.drop_duplicates(subset=['CL']).shape[0])
print('Distinct block sizes %d' % distinct_total.drop_duplicates(subset=['Block_Size']).shape[0])
display(distinct_total.describe())

488 distinct options from a total of 1620
97 distinct options from a total of 162
Distinct codecs 5
Distinct filters 3
Distinct CL 9
Distinct block sizes 10


Unnamed: 0,Block_Size,CL,CRate,CSpeed,DSpeed
count,488.0,488.0,488.0,488.0,488.0
mean,355.852,5.1,275.449,5.918,12.455
std,587.847,2.621,1422.534,6.053,8.774
min,0.0,1.0,1.101,0.004,0.47
25%,16.0,3.0,1.257,0.734,8.372
50%,64.0,5.0,3.916,4.551,9.473
75%,256.0,7.0,46.134,7.494,13.798
max,2048.0,9.0,10645.442,21.824,63.441


Zlib ha muerto.

In [9]:
# IMPRIMIMOS A NUESTRO MARCIANO FAVORITO
display(distinct_total[distinct_total.Codec == 'snappy'])

Unnamed: 0,Block_Size,Codec,Filter,CL,CRate,CSpeed,DSpeed
4319,32.0,snappy,noshuffle,3.0,8.332,4.128,9.029


Snappy está moribundo. Por tanto podríamos considerar que tenemos 488/1080 opciones totales y sin contar el tamaño de bloque 97/108.

In [10]:
print('%d blosclz classes from 270' % distinct_total[distinct_total.Codec == 'blosclz'].shape[0])
print('%d lz4 classes from 270' % distinct_total[distinct_total.Codec == 'lz4'].shape[0])
print('%d lz4hc classes from 270' % distinct_total[distinct_total.Codec == 'lz4hc'].shape[0])
print('%d zstd classes from 270' % distinct_total[distinct_total.Codec == 'zstd'].shape[0])

131 blosclz classes from 270
181 lz4 classes from 270
106 lz4hc classes from 270
69 zstd classes from 270


In [11]:
training_df.to_csv('../data/training_data.csv', sep='\t', index=False)