# Форматы данных (2)

Материалы:
* Макрушин С.В. "Лекция 5: Форматы данных (часть 2)"
* https://docs.python.org/3/library/csv.html
* https://docs.h5py.org/en/stable/
* Уэс Маккини. Python и анализ данных

## Задачи для совместного разбора

In [206]:
import csv
import json
import pickle
import pandas as pd
import numpy as np
import h5py

from bs4 import BeautifulSoup as bs
from pprint import pprint as pp

1. Считайте данные из файла `open_pubs.csv`, используя `csv.reader`, и преобразуйте к структуре данных следующего вида:
    
`{'fas_id': [24, 30, ...], 'name': ['Achor Inn', 'Angel Inn', ...], ... }`

In [207]:
import csv
import json
with open("open_pubs.csv") as fp:
    reader = csv.reader(fp)
    for row in reader:
        print(row)
        break

['fas_id', 'name', 'address', 'postcode', 'easting', 'northing', 'latitude', 'longitude', 'local_authority']


2. Сгенерируйте 2 случайные матрицы размера 10_000 x 10_000 и вычислите их произведение. Сколько времени занимают три этих операции? Сохраните 3 полученных матрицы в файл .npz с соответствующими названиями

In [208]:
import numpy as np

A = np.random.randint(0, 100, size=(10_000, 10_000))
B = np.random.randint(0, 100, size=(10_000, 10_000))

np.save('A.npy', A)
np.savez('AB.npz', artem=A, nikita=B)

r = np.load('AB.npz')
r.files

['artem', 'nikita']

3. Создайте 2 матрицы размера 1000x1000, используя различные параметризируемые распределения из numpy (https://docs.scipy.org/doc/numpy-1.15.0/reference/routines.random.html#distributions)

После этого сохраните получившиеся матрицы в hdf5-файл в виде двух различных датасетов. В качестве описания каждого датасета укажите параметры используемых распределений 

In [209]:
import h5py

with h5py.File('test.h5', 'w') as hdf:
    ds1 = hdf.create_dataset('arrA', data=A)
    ds2 = hdf.create_dataset('arrB', data=B)

    ds1.attrs['Description'] = 'Здесь лежит массив A'
    ds2.attrs['Description'] = 'Здесь лежит массив B'

with h5py.File('test.h5', 'r') as hdf:
    ds1 = hdf['arrA']
    print(type(ds1))
    arr = ds1[1500:5000]
ds1

<class 'h5py._hl.dataset.Dataset'>


<Closed HDF5 dataset>

## Лабораторная работа 5

### csv

1.1 В файле `tags_sample.csv` находится информация о тэгах, приписываемых рецептам. Воспользовавшись `csv.reader`, считайте этот файл и создайте словарь вида `id_рецепта: [список тэгов]`. Сохраните этот словарь в файл `tags_sample.json`.

In [210]:
id_tags_dict = {}
with open("tags_sample.csv") as fp:
    reader = csv.reader(fp)
    headers = next(reader)

    for row in reader:
        if row[0] not in id_tags_dict:
            id_tags_dict[row[0]] = list()
        id_tags_dict[row[0]].append(row[1])

with open("tags_sample.json", 'w') as j:
    json.dump(id_tags_dict, j)


1.2 Считайте файл `recipes_sample_with_filled_nsteps.csv` (__ЛР4__) в виде `pd.DataFrame`. Добавьте к таблице 2 столбца: `n_tags`, содержащий количество тэгов у этого рецепта; и `tags`, содержащий набор тэгов в виде строки (тэги внутри строки разделяются символом `;`)

In [211]:
df = pd.read_csv("recipes_sample_with_filled_nsteps.csv", parse_dates=["submitted"], index_col=0)
df.head()

Unnamed: 0_level_0,id,minutes,contributor_id,submitted,n_steps,description,n_ingredients
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
11,11,11,11,11,11.0,11,11.0
3,3,3,3,3,3.0,3,3.0
5,5,5,5,5,5.0,5,5.0
7,7,7,7,7,7.0,7,7.0
love is in the air beef fondue sauces,84797,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,


In [212]:
tags_series = pd.Series(id_tags_dict)
tags_df = pd.DataFrame(tags_series).reset_index()#, columns=["id", "tags"])
tags_df.columns = ['id', 'tags']
tags_df.head()

Unnamed: 0,id,tags
0,44123,"[weeknight, time-to-make, course, main-ingredi..."
1,67664,"[15-minutes-or-less, time-to-make, course, pre..."
2,38798,"[30-minutes-or-less, time-to-make, course, mai..."
3,35173,"[60-minutes-or-less, time-to-make, course, pre..."
4,84797,"[30-minutes-or-less, time-to-make, course, mai..."


In [213]:
tags_df["n_tags"] = tags_df["tags"].str.len()
tags_df.head()

Unnamed: 0,id,tags,n_tags
0,44123,"[weeknight, time-to-make, course, main-ingredi...",25
1,67664,"[15-minutes-or-less, time-to-make, course, pre...",31
2,38798,"[30-minutes-or-less, time-to-make, course, mai...",17
3,35173,"[60-minutes-or-less, time-to-make, course, pre...",11
4,84797,"[30-minutes-or-less, time-to-make, course, mai...",19


In [214]:
tags_df["tags"] = tags_df["tags"].apply(';'.join)
tags_df["id"] = pd.to_numeric(tags_df["id"])
tags_df.head()

Unnamed: 0,id,tags,n_tags
0,44123,weeknight;time-to-make;course;main-ingredient;...,25
1,67664,15-minutes-or-less;time-to-make;course;prepara...,31
2,38798,30-minutes-or-less;time-to-make;course;main-in...,17
3,35173,60-minutes-or-less;time-to-make;course;prepara...,11
4,84797,30-minutes-or-less;time-to-make;course;main-in...,19


In [215]:
recipes = df.merge(tags_df, left_on="id", right_on="id")
recipes.head()

Unnamed: 0,id,minutes,contributor_id,submitted,n_steps,description,n_ingredients,tags,n_tags
0,84797,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,,30-minutes-or-less;time-to-make;course;main-in...,19
1,107229,28,173674,2004-12-30,8.0,this is a very versatile and widely enjoyed pa...,12.0,30-minutes-or-less;time-to-make;course;main-in...,13
2,95926,5,118163,2004-07-20,4.0,you just have to try it to believe it.,,15-minutes-or-less;time-to-make;course;main-in...,30
3,453467,45,1848091,2011-04-11,12.0,this is the recipe that we use at my school ca...,11.0,60-minutes-or-less;time-to-make;cuisine;prepar...,9
4,306168,40,50969,2008-05-30,6.0,since there are already 411 recipes for brocco...,,60-minutes-or-less;time-to-make;course;main-in...,10


1.3 В файле `ingredients_sample.csv` находится информация о ингредиентах, необходимых для рецепта. Воспользовавшись `csv.DictReader`, считайте этот файл и создайте словарь вида `id_рецепта: [список ингредиентов]`.

In [216]:
dictionary = {}
with open("ingredients_sample.csv", "r") as csvf:
    reader = csv.DictReader(csvf)

    for row in reader:
        if int(row["recipe_id"]) not in dictionary:
            dictionary[int(row["recipe_id"])] = []
        dictionary[int(row["recipe_id"])].append(row["ingredient"])

dictionary

{44123: ['unsalted butter',
  'carrot',
  'onion',
  'celery',
  'broccoli stem',
  'dried thyme',
  'dried oregano',
  'dried sweet basil leaves',
  'dry white wine',
  'chicken stock',
  'worcestershire sauce',
  'tabasco sauce',
  'smoked chicken',
  'black beans',
  'broccoli floret',
  'heavy cream',
  'salt & fresh ground pepper',
  'cornstarch'],
 250900: ['unsalted butter',
  'all-purpose flour',
  'walnuts',
  'light brown sugar',
  'refrigerated pie crust',
  'granny smith apples'],
 120462: ['unsalted butter',
  'onion',
  'milk',
  'salt',
  'egg',
  'cream cheese',
  'extra-sharp cheddar cheese',
  'fresh ground black pepper',
  'garlic clove',
  'penne pasta',
  'gruyere cheese',
  'hot red pepper flakes',
  'sweet hungarian paprika',
  'saltines'],
 257111: ['unsalted butter',
  'milk',
  'eggs',
  'honey',
  'white bread',
  'vanilla',
  'ground cinnamon',
  'hot water'],
 148114: ['unsalted butter',
  'nuts',
  'granulated sugar',
  'semi-sweet chocolate chips'],
 1564

1.4 Добавьте к таблице из задания 1.2 столбец `ingredients`, содержащий набор ингредиентов в виде строки (ингредиенты внутри строки разделяются символом `*`)

Для строк, которые содержат пропуски в столбце `n_ingredients`, заполните их на основе файла  `ingredients_sample.csv`

In [217]:
# with open("./data/data/ingredients_sample.csv", 'r') as f:
ingredients_sample = pd.read_csv("ingredients_sample.csv")
ingredients_sample.head()

Unnamed: 0,ingredient,recipe_id
0,unsalted butter,44123
1,unsalted butter,250900
2,unsalted butter,120462
3,unsalted butter,257111
4,unsalted butter,148114


In [218]:
sr1 = pd.DataFrame(pd.Series(dictionary).reset_index())
sr1.columns= ["recipe_id", "ingredients"]
sr1["ingredients"] = sr1["ingredients"].apply('*'.join)
sr1["recipe_id"] = pd.to_numeric(sr1["recipe_id"])
sr1.head()

Unnamed: 0,recipe_id,ingredients
0,44123,unsalted butter*carrot*onion*celery*broccoli s...
1,250900,unsalted butter*all-purpose flour*walnuts*ligh...
2,120462,unsalted butter*onion*milk*salt*egg*cream chee...
3,257111,unsalted butter*milk*eggs*honey*white bread*va...
4,148114,unsalted butter*nuts*granulated sugar*semi-swe...


In [219]:
recipes = recipes.merge(sr1, left_on="id", right_on="recipe_id")
recipes.head()

Unnamed: 0,id,minutes,contributor_id,submitted,n_steps,description,n_ingredients,tags,n_tags,recipe_id,ingredients
0,84797,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,,30-minutes-or-less;time-to-make;course;main-in...,19,84797,beef steaks*vegetable oil*spicy mustard*fresh ...
1,107229,28,173674,2004-12-30,8.0,this is a very versatile and widely enjoyed pa...,12.0,30-minutes-or-less;time-to-make;course;main-in...,13,107229,vegetable oil*vermicelli*rice vinegar*reduced ...
2,95926,5,118163,2004-07-20,4.0,you just have to try it to believe it.,,15-minutes-or-less;time-to-make;course;main-in...,30,95926,white bread*mayonnaise*bananas
3,453467,45,1848091,2011-04-11,12.0,this is the recipe that we use at my school ca...,11.0,60-minutes-or-less;time-to-make;cuisine;prepar...,9,453467,eggs*margarine*brown sugar*salt*white sugar*va...
4,306168,40,50969,2008-05-30,6.0,since there are already 411 recipes for brocco...,,60-minutes-or-less;time-to-make;course;main-in...,10,306168,milk*garlic powder*salt*frozen broccoli cuts*c...


In [220]:
np.sum(np.isnan(recipes["n_ingredients"]))

5585

In [221]:
recipes["n_ingredients"] = recipes["n_ingredients"].fillna(recipes["id"].apply(lambda x: len(dictionary.get(x))))
recipes

Unnamed: 0,id,minutes,contributor_id,submitted,n_steps,description,n_ingredients,tags,n_tags,recipe_id,ingredients
0,84797,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,12.0,30-minutes-or-less;time-to-make;course;main-in...,19,84797,beef steaks*vegetable oil*spicy mustard*fresh ...
1,107229,28,173674,2004-12-30,8.0,this is a very versatile and widely enjoyed pa...,12.0,30-minutes-or-less;time-to-make;course;main-in...,13,107229,vegetable oil*vermicelli*rice vinegar*reduced ...
2,95926,5,118163,2004-07-20,4.0,you just have to try it to believe it.,3.0,15-minutes-or-less;time-to-make;course;main-in...,30,95926,white bread*mayonnaise*bananas
3,453467,45,1848091,2011-04-11,12.0,this is the recipe that we use at my school ca...,11.0,60-minutes-or-less;time-to-make;cuisine;prepar...,9,453467,eggs*margarine*brown sugar*salt*white sugar*va...
4,306168,40,50969,2008-05-30,6.0,since there are already 411 recipes for brocco...,9.0,60-minutes-or-less;time-to-make;course;main-in...,10,306168,milk*garlic powder*salt*frozen broccoli cuts*c...
...,...,...,...,...,...,...,...,...,...,...,...
18809,74023,50,89831,2003-10-24,14.0,this has been a long time family favorite!,8.0,60-minutes-or-less;time-to-make;course;prepara...,11,74023,eggs*butter*cheddar cheese*sour cream*flour*br...
18810,415406,45,485109,2010-03-04,5.0,this is a favourite winter warmer. by british ...,9.0,weeknight;60-minutes-or-less;time-to-make;cour...,15,415406,potatoes*onions*garlic cloves*cream cheese*chi...
18811,464576,70,226863,2011-09-20,14.0,this soup is a hearty meal! from luisa musso.,17.0,time-to-make;course;main-ingredient;cuisine;pr...,28,464576,onion*carrots*garlic cloves*olive oil*parmesan...
18812,267661,80,200862,2007-11-25,16.0,this is based on a french recipe but i changed...,10.0,time-to-make;course;main-ingredient;cuisine;pr...,18,267661,dry white wine*eggs*cheddar cheese*baking powd...


1.5 Проверьте, содержит ли столбец `n_ingredients` пропуски. Если нет, треобразуйте его к целочисленному типу и сохраните результаты в файл `recipes_sample_with_tags_ingredients.csv`

In [222]:
np.sum(np.isnan(recipes["n_ingredients"]))

0

In [223]:
recipes["n_ingredients"] = recipes["n_ingredients"].astype("int64")
recipes.head()

Unnamed: 0,id,minutes,contributor_id,submitted,n_steps,description,n_ingredients,tags,n_tags,recipe_id,ingredients
0,84797,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,12,30-minutes-or-less;time-to-make;course;main-in...,19,84797,beef steaks*vegetable oil*spicy mustard*fresh ...
1,107229,28,173674,2004-12-30,8.0,this is a very versatile and widely enjoyed pa...,12,30-minutes-or-less;time-to-make;course;main-in...,13,107229,vegetable oil*vermicelli*rice vinegar*reduced ...
2,95926,5,118163,2004-07-20,4.0,you just have to try it to believe it.,3,15-minutes-or-less;time-to-make;course;main-in...,30,95926,white bread*mayonnaise*bananas
3,453467,45,1848091,2011-04-11,12.0,this is the recipe that we use at my school ca...,11,60-minutes-or-less;time-to-make;cuisine;prepar...,9,453467,eggs*margarine*brown sugar*salt*white sugar*va...
4,306168,40,50969,2008-05-30,6.0,since there are already 411 recipes for brocco...,9,60-minutes-or-less;time-to-make;course;main-in...,10,306168,milk*garlic powder*salt*frozen broccoli cuts*c...


In [224]:
recipes.to_csv("recipes_sample_with_tags_ingredients.csv", index=False)

### npy

2.1 Разделите таблицу, полученную в результате 1.5, на две таблицы: одна содержит рецепты, загруженные до 2000 года; вторая - все остальные. В полученных таблицах оставьте только числовые столбцы и преобразуйте их к `numpy.array`

In [225]:
recipes_npy = recipes.copy()
recipes_npy["submitted"] = pd.to_datetime(recipes_npy["submitted"],format='%Y-%m-%d', errors='coerce')

In [226]:
recipes_before_2000 = recipes_npy[recipes_npy["submitted"] < pd.to_datetime("1/1/2000")]
recipes_before_2000 = recipes_before_2000.drop(columns=["description", "tags", "ingredients"]).to_numpy()
recipes_before_2000

array([[3441, 30, 1562, ..., 8, 10, 3441],
       [4205, 25, 1617, ..., 5, 14, 4205],
       [5197, 0, 1534, ..., 5, 16, 5197],
       ...,
       [3189, 95, 59780, ..., 12, 13, 3189],
       [4801, 20, 1598, ..., 7, 18, 4801],
       [2982, 0, 124030, ..., 7, 13, 2982]], dtype=object)

In [227]:
recipes_after_2000 = recipes_npy[recipes_npy["submitted"] >= pd.to_datetime("1/1/2000")]
recipes_after_2000 = recipes_after_2000.drop(columns=["description", "tags", "ingredients"]).to_numpy()
recipes_after_2000

array([[84797, 25, 4470, ..., 12, 19, 84797],
       [107229, 28, 173674, ..., 12, 13, 107229],
       [95926, 5, 118163, ..., 3, 30, 95926],
       ...,
       [464576, 70, 226863, ..., 17, 28, 464576],
       [267661, 80, 200862, ..., 10, 18, 267661],
       [298512, 29, 506822, ..., 10, 12, 298512]], dtype=object)

2.2. Сохраните 2 полученных массива в архив `npz`. Дайте массивам читаемые имена.

In [228]:
np.savez("recipes.npz", recipes_after_2000=recipes_after_2000, recipes_before_2000=recipes_before_2000)

2.3 Считайте созданный архив и продемонстрируйте, что данные считались корректно. 

In [229]:
npz_file = np.load("recipes.npz", allow_pickle=True)

In [230]:
npz_file["recipes_after_2000"]

array([[84797, 25, 4470, ..., 12, 19, 84797],
       [107229, 28, 173674, ..., 12, 13, 107229],
       [95926, 5, 118163, ..., 3, 30, 95926],
       ...,
       [464576, 70, 226863, ..., 17, 28, 464576],
       [267661, 80, 200862, ..., 10, 18, 267661],
       [298512, 29, 506822, ..., 10, 12, 298512]], dtype=object)

In [231]:
npz_file["recipes_before_2000"]

array([[3441, 30, 1562, ..., 8, 10, 3441],
       [4205, 25, 1617, ..., 5, 14, 4205],
       [5197, 0, 1534, ..., 5, 16, 5197],
       ...,
       [3189, 95, 59780, ..., 12, 13, 3189],
       [4801, 20, 1598, ..., 7, 18, 4801],
       [2982, 0, 124030, ..., 7, 13, 2982]], dtype=object)

### hdf

3.1 Выведите названия всех датасетов, находящихся в файле `nutrition_sample.h5`, а также размерность матриц, содержащихся в данных датасетах и их метаданные.

Формат вывода:
```
Dataset name=dataset_0, dataset size=(30000,), metadata={'info': 'calories (#)'}
Dataset name=dataset_1, dataset size=(30000,), metadata={'info': 'total fat (PDV)'}
...
```

In [232]:
with h5py.File('nutrition_sample.h5', 'r') as hdf:
    for i in hdf.items():
        print(f"Dataset name={i[0]}, dataset size={i[1].shape}, metadata={dict(i[1].attrs)}")

Dataset name=dataset_0, dataset size=(30000, 2), metadata={'col_0': 'recipe_id', 'col_1': 'calories (#)'}
Dataset name=dataset_1, dataset size=(30000, 2), metadata={'col_0': 'recipe_id', 'col_1': 'total fat (PDV)'}
Dataset name=dataset_2, dataset size=(30000, 2), metadata={'col_0': 'recipe_id', 'col_1': 'sugar (PDV)'}
Dataset name=dataset_3, dataset size=(30000, 2), metadata={'col_0': 'recipe_id', 'col_1': 'sodium (PDV)'}
Dataset name=dataset_4, dataset size=(30000, 2), metadata={'col_0': 'recipe_id', 'col_1': 'protein (PDV)'}
Dataset name=dataset_5, dataset size=(30000, 2), metadata={'col_0': 'recipe_id', 'col_1': 'saturated fat (PDV)'}
Dataset name=dataset_6, dataset size=(30000, 2), metadata={'col_0': 'recipe_id', 'col_1': 'carbohydrates (PDV)'}


3.2 Разбейте каждый из имеющихся датасетов на две части: 1 часть содержит только те строки, где PDV (Percent Daily Value) превышает 100%; 2 часть содержит те строки, где PDV не составляет не более 100%. Создайте 2 группы в файле и разместите в них соответствующие части датасета c сохранением метаданных исходных датасетов. Итого должно получиться 2 группы, содержащие несколько датасетов. Сохраните результаты в файл `nutrition_grouped.h5`

In [233]:
with h5py.File("nutrition_sample.h5", 'r') as h_file:
    with h5py.File("nutrition_grouped.h5", 'w') as h_writefile:
        group_less_than_100 = h_writefile.create_group('PDV_less_than_100')
        group_more_than_100 = h_writefile.create_group('PDV_more_than_100')

        for i in h_file:
            dataset = h_file.get(i)
            dataset_meta = dict(dataset.attrs)
            dataset_copy = dataset[:, :]

            dataset_pdv_less = dataset_copy[:, :]
            dataset_pdv_less = dataset_pdv_less[dataset_pdv_less[:, 1] < 1]

            dataset_pdv_more = dataset_copy[:, :]
            dataset_pdv_more = dataset_pdv_more[dataset_pdv_more[:, 1] > 1]

            for g in h_writefile:
                h_writefile.get(g).create_dataset(
                    name=i, shape=dataset_pdv_less.shape,
                    dtype=dataset_pdv_less.dtype,
                    data=dataset_pdv_less
                )

                for k in dataset_meta:
                    h_writefile.get(g).get(name=i).attrs[k] = dataset_meta[k]

        for g in h_writefile:
            print(h_writefile.get(g))

<HDF5 group "/PDV_less_than_100" (7 members)>
<HDF5 group "/PDV_more_than_100" (7 members)>


3.3 Выведите названия всех групп и датасетов, находящихся в этих группах, из файла `nutrition_grouped.h5` а также размерность матриц, содержащихся в датасетах и их метаданные.

In [234]:
with h5py.File("nutrition_grouped.h5", 'r') as grouped_hf:
    for group in grouped_hf:
        print("GROUP: ", group, end="\n")
        for dataset in grouped_hf.get(group):
            print("|->", dataset)
            print("|--> shape:", grouped_hf.get(group).get(dataset).shape)
            print("|--> metadata:", dict(grouped_hf.get(group).get(dataset).attrs))
            print("--------------------------------------------------------")
        print("\n\n")

GROUP:  PDV_less_than_100
|-> dataset_0
|--> shape: (12, 2)
|--> metadata: {'col_0': 'recipe_id', 'col_1': 'calories (#)'}
--------------------------------------------------------
|-> dataset_1
|--> shape: (2146, 2)
|--> metadata: {'col_0': 'recipe_id', 'col_1': 'total fat (PDV)'}
--------------------------------------------------------
|-> dataset_2
|--> shape: (1371, 2)
|--> metadata: {'col_0': 'recipe_id', 'col_1': 'sugar (PDV)'}
--------------------------------------------------------
|-> dataset_3
|--> shape: (2588, 2)
|--> metadata: {'col_0': 'recipe_id', 'col_1': 'sodium (PDV)'}
--------------------------------------------------------
|-> dataset_4
|--> shape: (1274, 2)
|--> metadata: {'col_0': 'recipe_id', 'col_1': 'protein (PDV)'}
--------------------------------------------------------
|-> dataset_5
|--> shape: (2569, 2)
|--> metadata: {'col_0': 'recipe_id', 'col_1': 'saturated fat (PDV)'}
--------------------------------------------------------
|-> dataset_6
|--> shape: (165

3.4 Модифицируйте код из 3.3 таким образом, чтобы сохранить датасеты, используя сжатие. Сравните размер полученного файла с размерами файла из 3.3. Прокомментируйте результат.

In [235]:
with h5py.File("nutrition_grouped.h5", 'r') as grouped_hf, h5py.File("nutrition_grouped_zip.h5", 'w') as zip_hf:
        for group in grouped_hf:
            for dataset in grouped_hf.get(group):
                array = grouped_hf[group][dataset]
                zip_hf.create_dataset(name=array.name, data=array, compression="lzf")

        for g in zip_hf:
            print(zip_hf.get(g))
            for dataset in grouped_hf.get(group):
                print(dataset)

<HDF5 group "/PDV_less_than_100" (7 members)>
dataset_0
dataset_1
dataset_2
dataset_3
dataset_4
dataset_5
dataset_6
<HDF5 group "/PDV_more_than_100" (7 members)>
dataset_0
dataset_1
dataset_2
dataset_3
dataset_4
dataset_5
dataset_6


In [236]:
import os

print(f'hdf5      : {os.stat("nutrition_grouped.h5").st_size} bytes')
print(f'hdf5 lzf : {os.stat("nutrition_grouped_zip.h5").st_size} bytes')

hdf5      : 384992 bytes
hdf5 lzf : 168430 bytes
