# Generador de los datos de entrenamiento

## Objetivos del análisis
* Extraer el data frame final con los datos preparados para entrenar algoritmos machine learning.

## Descripción de la muestra

El DataFrame en cuestión está formado por las características extraídas de un array de datos al comprimirlo y descomprimirlo mediante blosc. En cada fichero aparecen distintos conjuntos de datos los cuáles dividimos en fragmentos de 16 MegaBytes y sobre los cuales realizamos las pruebas de compresión y decompresión.  
Cada fila se corresponde con los datos de realizar los test de compresión sobre un fragmento (*chunk*) de datos específico con un tamaño de bloque, codec, filtro y nivel de compresión determinados.

Variable | Descripción
-------------  | -------------
*Filename* | nombre del fichero del que proviene.
*DataSet* | dentro del fichero el conjunto de datos del que proviene.
*Table* | 0 si los datos vienen de un array, 1 si vienen de tablas y 2 para tablas columnares.
*DType* | indica el tipo de los datos.
*Chunk_Number* | número de fragmento dentro del conjunto de datos.
*Chunk_Size* | tamaño del fragmento.
*Mean* | la media.
*Median* | la mediana.
*Sd* | la desviación típica.
*Skew* | el coeficiente de asimetría.
*Kurt* | el coeficiente de apuntamiento.
*Min* | el mínimo absoluto.
*Max* | el máximo absoluto.
*Q1* | el primer cuartil.
*Q3* | el tercer cuartil.
*N_Streaks* | número de rachas seguidas por encima o debajo de la mediana.
*Block_Size* | el tamaño de bloque que utilizará Blosc para comprimir.
*Codec* | el codec de blosc utilizado.
*Filter* | el filtro de blosc utilizado.
*CL* | el nivel de compresión utilizado.
*CRate* | el ratio de compresión obtenido.
*CSpeed* | la velocidad de compresión obtenida en GB/s.
*DSpeed* | la velocidad de decompresión obtenida en GB/s.

In [1]:
%load_ext autoreload
%autoreload 2

%load_ext version_information
%version_information numpy, scipy, matplotlib, pandas

Software,Version
Python,3.5.2 64bit [MSC v.1900 64 bit (AMD64)]
IPython,5.3.0
OS,Windows 10 10.0.14393 SP0
numpy,1.11.3
scipy,0.19.0
matplotlib,2.0.0
pandas,0.19.2
Mon Apr 03 10:47:24 2017 Hora de verano romance,Mon Apr 03 10:47:24 2017 Hora de verano romance


In [2]:
import numpy as np
import pandas as pd
from IPython.display import display

pd.options.display.float_format = '{:,.3f}'.format

In [3]:
CHUNK_ID = ["Filename", "DataSet", "Table", "Chunk_Number"]
#CHUNK_FEATURES = ["Table", "DType", "Chunk_Size", "Mean", "Median", "Sd", "Skew", "Kurt", "Min", "Max", "Q1", "Q3", "N_Streaks"]
CHUNK_FEATURES = ["Table", "DType", "Chunk_Size", "Mean", "Median", "Sd", "Skew", "Kurt", "Min", "Max", "Q1", "Q3"]
OUT_OPTIONS = ["Block_Size", "Codec", "Filter", "CL"]
TEST_FEATURES = ["CRate", "CSpeed", "DSpeed"]
COLS = ["Filename" , "DataSet", "Chunk_Number"] + CHUNK_FEATURES + OUT_OPTIONS + TEST_FEATURES
IN_TESTS = ['BLZ_CRate', 'BLZ_CSpeed', 'BLZ_DSpeed', 'LZ4_CRate', 'LZ4_CSpeed', 'LZ4_DSpeed']
IN_USER = ['IN_CR', 'IN_CS', 'IN_DS']

In [4]:
df = pd.read_csv('../data/blosc_test_data.csv.gz', sep='\t')
my_df = df[(df.Filename != 'WRF_India-LSD1.h5') & (df.Filename != 'WRF_India-LSD2.h5') 
           & (df.Filename != 'WRF_India-LSD3.h5') & (df.CL != 0) & (df.CRate > 1.1)]

In [5]:
# DATAFRAME WITH DISTINCT CHUNKS
chunks_df = my_df.drop_duplicates(subset=CHUNK_ID)
print("%d rows" % chunks_df.shape[0])
chunk_tests_list = []
# FOR EACH CHUNK
for index, row in chunks_df.iterrows():
    # DATAFRAME WITH CHUNK TESTS
    chunk_tests_list.append(my_df[(my_df.Filename == row["Filename"]) & (my_df.DataSet == row["DataSet"]) &
                        (my_df.Table == row["Table"]) & (my_df.Chunk_Number == row["Chunk_Number"])])

673 rows


In [6]:
training_df = pd.DataFrame()
for chunk_test in chunk_tests_list:
    # EXTRACT MAX MIN AND SOME AUX MAX INDICES
    i_max_crate, i_max_c_speed, i_max_d_speed = chunk_test['CRate'].idxmax(), chunk_test['CSpeed'].idxmax(),\
                                                chunk_test['DSpeed'].idxmax()
    max_crate, max_c_speed, max_d_speed = (chunk_test.ix[i_max_crate]['CRate'], chunk_test.ix[i_max_c_speed]['CSpeed'],
                                           chunk_test.ix[i_max_d_speed]['DSpeed'])

    min_crate, min_c_speed, min_d_speed = (chunk_test['CRate'].min(), chunk_test['CSpeed'].min(),
                                           chunk_test['DSpeed'].min())
    # NORMALIZED COLUMNS
    chunk_test = chunk_test.assign(N_CRate=(chunk_test['CRate'] - min_crate) / (max_crate - min_crate),
                                   N_CSpeed=(chunk_test['CSpeed'] - min_c_speed) / (max_c_speed - min_c_speed),
                                   N_DSpeed=(chunk_test['DSpeed'] - min_d_speed) / (max_d_speed - min_d_speed))
    # DISTANCE FUNC COLUMNS
    chunk_test = chunk_test.assign(Distance_1=(chunk_test['N_CRate'] - 1)**2 + (chunk_test['N_CSpeed'] - 1)**2,
                                   Distance_2=(chunk_test['N_CRate'] - 1) ** 2 + (chunk_test['N_DSpeed'] - 1) ** 2,
                                   Distance_3=(chunk_test['N_CRate'] - 1) ** 2 + (chunk_test['N_DSpeed'] - 1) ** 2 +
                                              (chunk_test['N_CSpeed'] - 1) ** 2,
                                   Distance_4=(chunk_test['N_CSpeed'] - 1) ** 2 + (chunk_test['N_DSpeed'] - 1) ** 2
                                   )
    # BALANCED INDICES
    i_balanced_c_speed, i_balanced_d_speed, i_balanced, i_balanced_speeds = (chunk_test['Distance_1'].idxmin(),
                                                                             chunk_test['Distance_2'].idxmin(),
                                                                             chunk_test['Distance_3'].idxmin(),
                                                                             chunk_test['Distance_4'].idxmin())
    indices = [i_max_d_speed, i_max_c_speed, i_balanced_speeds, i_max_crate, i_balanced_d_speed, i_balanced_c_speed,
               i_balanced]
    # TYPE FILTER FOR LZ_DATA
    d_type = chunk_test.iloc[0]['DType']
    filter_name = 'noshuffle'
    if 'float' in d_type or 'int' in d_type:
        filter_name = 'shuffle'
    aux = df[(df.CL == 1) & (df.Block_Size == 0) & (df.Filter == filter_name) &
             (df.Filename == chunk_test.iloc[0]['Filename']) & (df.DataSet == chunk_test.iloc[0]['DataSet']) &
             (df.Table == chunk_test.iloc[0]['Table']) & (df.Chunk_Number == chunk_test.iloc[0]['Chunk_Number'])]
    lz_data = np.append(aux[aux.Codec == 'blosclz'][TEST_FEATURES].values[0],
                        aux[aux.Codec == 'lz4'][TEST_FEATURES].values[0])
    # APPEND ROWS TO TRAINING DATA FRAME
    for i in range(len(indices)):
        in_1, r = divmod((i+1), 4)
        in_2, in_3 = divmod(r, 2)
        training_df = training_df.append(dict(zip(COLS + IN_TESTS + IN_USER,
                                                  np.append(np.append(chunk_test.ix[indices[i]][COLS].values,
                                                                      lz_data),
                                                            [in_1, in_2, in_3]))),
                                         ignore_index=True)

## Algunas comprobaciones

In [7]:
print('DISTINCT MAX RATE')
distinct_max_rate = training_df[(training_df.IN_CR == 1) & (training_df.IN_CS == 0) & (training_df.IN_DS == 0)]\
                    .drop_duplicates(subset=OUT_OPTIONS)[OUT_OPTIONS + TEST_FEATURES]
print('%s rows' % distinct_max_rate.shape[0])
display(distinct_max_rate)
print('DISTINCT MAX C.SPEED')
distinct_max_c_speed = training_df[(training_df.IN_CR == 0) & (training_df.IN_CS == 1) & (training_df.IN_DS == 0)]\
                       .drop_duplicates(subset=OUT_OPTIONS)[OUT_OPTIONS + TEST_FEATURES]
print('%s rows' % distinct_max_c_speed.shape[0])
display(distinct_max_c_speed)
print('DISTINCT MAX D.SPEED')
distinct_max_d_speed = training_df[(training_df.IN_CR == 0) & (training_df.IN_CS == 0) & (training_df.IN_DS == 1)]\
                      .drop_duplicates(subset=OUT_OPTIONS)[OUT_OPTIONS + TEST_FEATURES]
print('%s rows' % distinct_max_d_speed.shape[0])
display(distinct_max_d_speed)
print('DISTINCT BALANCED CSPEED')
distinct_balanced_c_speed = training_df[(training_df.IN_CR == 1) & (training_df.IN_CS == 1) & (training_df.IN_DS == 0)]\
                            .drop_duplicates(subset=OUT_OPTIONS)[OUT_OPTIONS + TEST_FEATURES]
print('%s rows' % distinct_balanced_c_speed.shape[0])
display(distinct_balanced_c_speed)
print('DISTINCT BALANCED DSPEED')
distinct_balanced_d_speed = training_df[(training_df.IN_CR == 1) & (training_df.IN_CS == 0) & (training_df.IN_DS == 1)]\
                            .drop_duplicates(subset=OUT_OPTIONS)[OUT_OPTIONS + TEST_FEATURES]
print('%s rows' % distinct_balanced_d_speed.shape[0])
display(distinct_balanced_d_speed)
print('DISTINCT BALANCED SPEED')
distinct_balanced_speed = training_df[(training_df.IN_CR == 0) & (training_df.IN_CS == 1) & (training_df.IN_DS == 1)]\
                          .drop_duplicates(subset=OUT_OPTIONS)[OUT_OPTIONS + TEST_FEATURES]
print('%s rows' % distinct_balanced_speed.shape[0])
display(distinct_balanced_speed)
print('DISTINCT BALANCED')
distinct_balanced = training_df[(training_df.IN_CR == 1) & (training_df.IN_CS == 1) & (training_df.IN_DS == 1)]\
                    .drop_duplicates(subset=OUT_OPTIONS)[OUT_OPTIONS + TEST_FEATURES]
print('%s rows' % distinct_balanced.shape[0])
display(distinct_balanced)

DISTINCT MAX RATE
24 rows


Unnamed: 0,Block_Size,Codec,Filter,CL,CRate,CSpeed,DSpeed
3,2048.0,zstd,shuffle,9.0,1.366,0.014,3.32
654,0.0,zstd,shuffle,9.0,49.115,0.066,4.596
661,0.0,zstd,shuffle,8.0,1.212,0.011,1.145
668,0.0,zstd,shuffle,7.0,1.179,0.013,1.05
696,2048.0,zstd,bitshuffle,9.0,4.588,0.057,2.761
738,2048.0,zstd,noshuffle,9.0,2.915,0.004,1.191
745,0.0,zstd,noshuffle,7.0,1.769,0.073,2.253
864,2048.0,zstd,bitshuffle,7.0,3334.325,0.898,2.621
913,2048.0,zstd,shuffle,8.0,113.87,0.47,5.591
1004,2048.0,zstd,noshuffle,4.0,10645.442,5.965,13.296


DISTINCT MAX C.SPEED
121 rows


Unnamed: 0,Block_Size,Codec,Filter,CL,CRate,CSpeed,DSpeed
1,64.000,lz4,shuffle,3.000,1.244,6.440,8.259
8,16.000,lz4,shuffle,1.000,1.233,6.758,8.311
22,0.000,lz4,shuffle,1.000,1.217,6.533,8.251
29,64.000,lz4,shuffle,1.000,1.225,6.336,8.421
50,64.000,lz4,shuffle,4.000,1.226,6.166,8.820
71,32.000,lz4,shuffle,1.000,1.221,6.324,8.349
78,32.000,lz4,shuffle,2.000,1.223,6.195,8.309
106,64.000,lz4,shuffle,2.000,1.228,6.190,8.516
120,0.000,lz4,shuffle,2.000,1.240,6.464,8.889
365,16.000,lz4,shuffle,6.000,1.253,6.434,11.005


DISTINCT MAX D.SPEED
168 rows


Unnamed: 0,Block_Size,Codec,Filter,CL,CRate,CSpeed,DSpeed
0,32.000,blosclz,shuffle,1.000,1.108,4.402,9.849
7,32.000,blosclz,bitshuffle,2.000,1.110,3.669,9.728
14,16.000,blosclz,shuffle,4.000,1.170,2.678,9.861
21,8.000,blosclz,shuffle,2.000,1.119,4.288,9.845
28,64.000,blosclz,shuffle,3.000,1.129,3.476,9.781
35,128.000,lz4,shuffle,5.000,1.235,5.643,9.389
49,8.000,blosclz,shuffle,3.000,1.122,3.133,9.358
56,128.000,blosclz,shuffle,2.000,1.113,4.087,9.786
63,0.000,blosclz,shuffle,4.000,1.178,2.672,9.715
70,0.000,blosclz,shuffle,3.000,1.115,3.441,9.658


DISTINCT BALANCED CSPEED
86 rows


Unnamed: 0,Block_Size,Codec,Filter,CL,CRate,CSpeed,DSpeed
5,256.000,lz4,shuffle,4.000,1.252,5.825,8.554
12,256.000,lz4,shuffle,8.000,1.255,6.033,8.148
19,128.000,lz4,shuffle,5.000,1.248,6.201,9.488
26,256.000,lz4,shuffle,6.000,1.242,5.558,8.574
33,0.000,lz4,shuffle,8.000,1.241,5.440,8.557
47,128.000,lz4,shuffle,9.000,1.240,5.069,8.449
68,128.000,lz4,shuffle,8.000,1.246,5.466,8.650
82,512.000,lz4,shuffle,8.000,1.243,5.329,7.827
103,256.000,lz4,shuffle,7.000,1.233,5.358,8.169
124,0.000,lz4,bitshuffle,2.000,1.264,5.749,8.606


DISTINCT BALANCED DSPEED
150 rows


Unnamed: 0,Block_Size,Codec,Filter,CL,CRate,CSpeed,DSpeed
4,256.000,lz4hc,shuffle,8.000,1.310,0.134,8.219
11,128.000,lz4hc,shuffle,8.000,1.300,0.153,8.513
25,256.000,lz4hc,shuffle,5.000,1.292,0.190,8.783
32,128.000,lz4hc,shuffle,9.000,1.286,0.117,8.351
39,256.000,lz4hc,shuffle,7.000,1.290,0.147,8.150
67,0.000,lz4hc,shuffle,7.000,1.295,0.156,8.556
74,256.000,lz4hc,shuffle,6.000,1.287,0.177,8.857
88,512.000,lz4hc,shuffle,9.000,1.292,0.089,8.237
102,256.000,lz4hc,shuffle,9.000,1.289,0.098,8.428
144,128.000,lz4hc,shuffle,7.000,1.337,0.184,8.907


DISTINCT BALANCED SPEED
142 rows


Unnamed: 0,Block_Size,Codec,Filter,CL,CRate,CSpeed,DSpeed
2,32.000,lz4,shuffle,3.000,1.239,6.157,8.908
9,16.000,lz4,shuffle,5.000,1.238,6.366,9.139
16,0.000,lz4,shuffle,4.000,1.239,6.501,9.179
23,128.000,lz4,shuffle,2.000,1.233,6.000,9.044
30,64.000,lz4,shuffle,2.000,1.226,6.196,8.831
44,16.000,lz4,shuffle,2.000,1.215,6.046,8.550
58,0.000,lz4,shuffle,3.000,1.223,6.079,9.284
65,0.000,lz4,shuffle,2.000,1.224,6.287,9.158
79,32.000,lz4,shuffle,7.000,1.229,5.864,8.974
86,32.000,lz4,shuffle,5.000,1.220,5.856,8.940


DISTINCT BALANCED
137 rows


Unnamed: 0,Block_Size,Codec,Filter,CL,CRate,CSpeed,DSpeed
6,256.000,lz4,shuffle,5.000,1.252,5.722,8.805
13,128.000,lz4,shuffle,6.000,1.250,6.042,9.115
20,128.000,lz4,shuffle,5.000,1.248,6.201,9.488
27,256.000,lz4,shuffle,6.000,1.242,5.558,8.574
34,0.000,lz4,shuffle,8.000,1.241,5.440,8.557
41,128.000,lz4,shuffle,8.000,1.239,5.462,8.981
48,128.000,lz4,shuffle,9.000,1.240,5.069,8.449
55,256.000,lz4,shuffle,8.000,1.240,5.200,8.912
62,256.000,lz4,shuffle,7.000,1.244,5.451,8.562
69,0.000,lz4,shuffle,7.000,1.244,5.598,8.965


In [8]:
distinct_total = training_df.drop_duplicates(subset=OUT_OPTIONS)[OUT_OPTIONS + TEST_FEATURES]
print('%d distinct options from a total of %d' % (distinct_total.shape[0], 1620))
distinct_total_noblock = distinct_total.drop_duplicates(subset=OUT_OPTIONS[1:4])
print('%d distinct options from a total of %d' % (distinct_total_noblock.shape[0], 162))
print('Distinct codecs %d' % distinct_total.drop_duplicates(subset=['Codec']).shape[0])
print('Distinct filters %d' % distinct_total.drop_duplicates(subset=['Filter']).shape[0])
print('Distinct CL %d' % distinct_total.drop_duplicates(subset=['CL']).shape[0])
print('Distinct block sizes %d' % distinct_total.drop_duplicates(subset=['Block_Size']).shape[0])
display(distinct_total.describe())

448 distinct options from a total of 1620
94 distinct options from a total of 162
Distinct codecs 5
Distinct filters 3
Distinct CL 9
Distinct block sizes 10


Unnamed: 0,Block_Size,CL,CRate,CSpeed,DSpeed
count,448.0,448.0,448.0,448.0,448.0
mean,327.839,5.319,353.398,6.766,13.798
std,565.141,2.608,1650.551,6.858,11.137
min,0.0,1.0,1.101,0.004,0.474
25%,16.0,3.0,1.264,0.732,8.78
50%,64.0,6.0,5.852,4.916,10.383
75%,256.0,8.0,50.728,9.687,14.505
max,2048.0,9.0,10645.442,23.848,86.345


Zlib ha muerto.

In [9]:
# IMPRIMIMOS A NUESTRO MARCIANO FAVORITO
display(distinct_total[distinct_total.Codec == 'snappy'])

Unnamed: 0,Block_Size,Codec,Filter,CL,CRate,CSpeed,DSpeed
1681,0.0,snappy,noshuffle,7.0,21.195,20.349,9.824
4291,128.0,snappy,noshuffle,3.0,11.049,7.158,13.803


Snappy está moribundo. Por tanto podríamos considerar que tenemos 488/1080 opciones totales y sin contar el tamaño de bloque 97/108.

In [10]:
print('%d blosclz classes from 270' % distinct_total[distinct_total.Codec == 'blosclz'].shape[0])
print('%d lz4 classes from 270' % distinct_total[distinct_total.Codec == 'lz4'].shape[0])
print('%d lz4hc classes from 270' % distinct_total[distinct_total.Codec == 'lz4hc'].shape[0])
print('%d zstd classes from 270' % distinct_total[distinct_total.Codec == 'zstd'].shape[0])

128 blosclz classes from 270
158 lz4 classes from 270
93 lz4hc classes from 270
67 zstd classes from 270


In [11]:
training_df.to_csv('../data/training_data.csv', sep='\t', index=False)

## Comprobación del tamaño de bloque automático

In [15]:
print("%d from %d" % (training_df[training_df.Block_Size == 0].shape[0], training_df.shape[0]))

759 from 4711


In [31]:
count = 0
report = []
for indices, row in training_df.iterrows():
    block = row['Block_Size']
    if block != 0:
        aux = df[(df.Filename == row['Filename']) & (df.DataSet == row['DataSet']) &
                 (df.Table == row['Table']) & (df.Chunk_Number == row['Chunk_Number']) &
                 (df.Codec == row['Codec']) & (df.Filter == row['Filter']) & (df.CL == row["CL"])]
        crate = aux[aux.Block_Size == 0]['CRate'].values[0]
        result = aux[(aux.CRate == crate) & (aux.Block_Size != 0)]
        auto_block = result['Block_Size'].values[0]
        if auto_block == block:
            count += 1
        else:
            report.append((block, auto_block))

In [34]:
print("%d from %d" % (training_df[training_df.Block_Size == 0].shape[0] + count, training_df.shape[0]))

1241 from 4711


In [38]:
for line in report:
    print("Winner --> %d | Predicted --> %d" % line)

Winner --> 32 | Predicted --> 16
Winner --> 64 | Predicted --> 16
Winner --> 32 | Predicted --> 16
Winner --> 2048 | Predicted --> 1024
Winner --> 256 | Predicted --> 512
Winner --> 256 | Predicted --> 32
Winner --> 256 | Predicted --> 32
Winner --> 32 | Predicted --> 16
Winner --> 16 | Predicted --> 32
Winner --> 2048 | Predicted --> 1024
Winner --> 128 | Predicted --> 512
Winner --> 128 | Predicted --> 64
Winner --> 16 | Predicted --> 32
Winner --> 2048 | Predicted --> 1024
Winner --> 256 | Predicted --> 512
Winner --> 128 | Predicted --> 32
Winner --> 128 | Predicted --> 32
Winner --> 8 | Predicted --> 16
Winner --> 128 | Predicted --> 16
Winner --> 2048 | Predicted --> 1024
Winner --> 256 | Predicted --> 64
Winner --> 256 | Predicted --> 64
Winner --> 256 | Predicted --> 64
Winner --> 64 | Predicted --> 16
Winner --> 64 | Predicted --> 16
Winner --> 64 | Predicted --> 16
Winner --> 2048 | Predicted --> 1024
Winner --> 128 | Predicted --> 1024
Winner --> 128 | Predicted --> 32
Winne