## h5py
层级数据格式（Hierarchical Data Format：HDF）是设计用来存储和组织大量数据的一组文件格式（HDF4，HDF5）

h5py文件是存放两类对象的容器，数据集(dataset)和组(group)，
- dataset类似数组类的数据集合，和numpy的数组差不多。
- group是像文件夹一样的容器，它好比python中的字典，有键(key)和值(value)。group中可以存放dataset或者其他的group。”键”就是组成员的名称，”值”就是组成员对象本身(组或者数据集)

In [1]:
import numpy as np
np.set_printoptions(threshold=16, suppress=True, precision=5)

In [2]:
# 创建h5文件
import h5py

f = h5py.File('first_h5py.h5', 'w')  # 读 'r'

###  创建dataset数据集

In [None]:
# f.create_dataset?
# f.create_dataset(name, shape=None, dtype=None, data=None, **kwds)

In [3]:
# 创建dataset数据集
# 数据集名称, shape, 数据类型 
d1 = f.create_dataset('dataset1', (20, ), 'i')

In [4]:
f.keys()

<KeysViewHDF5 ['dataset1']>

In [5]:
for key in f.keys():
    print(key)
    print(f[key].name)
    print(f[key].shape)
    # print(f[key].value)
    print(f[key][:3])

dataset1
/dataset1
(20,)
[0 0 0]


In [6]:
# 赋值
d1[...] = np.arange(20)

In [7]:
# 创建数据集并赋值
f['dataset2'] = np.arange(15)

In [8]:
f['dataset2'][...]

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [9]:
# 在创建数据集时 赋值
d2 = f.create_dataset('dataset3', data=np.arange(12).reshape((3, 4)))

In [10]:
for key in f.keys():
    print(key)
    print(f[key].name)
    print(f[key].shape)
    print(f[key][...])
    print()

dataset1
/dataset1
(20,)
[ 0  1  2 ... 17 18 19]

dataset2
/dataset2
(15,)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]

dataset3
/dataset3
(3, 4)
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]



### 创建group

In [None]:
# f.create_group?
# f.create_group(name, track_order=None)

In [11]:
g1 = f.create_group('bar')

In [12]:
g1['data1'] = np.arange(12)
g1['data2'] = np.arange(12).reshape((4, 3))

In [13]:
for key in g1.keys():
    print(g1[key].name)
    print(g1[key][:])
    print()

/bar/data1
[ 0  1  2  3  4  5  6  7  8  9 10 11]

/bar/data2
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]



### 读取h5文件数据

In [14]:
f = h5py.File('first_h5py.h5', 'r')

In [15]:
f.keys()

<KeysViewHDF5 ['bar', 'dataset1', 'dataset2', 'dataset3']>

In [16]:
dataset1 = f['dataset1']
dataset1

<HDF5 dataset "dataset1": shape (20,), type "<i4">

In [17]:
np_data1 = np.array(dataset1[:])
np_data1

array([ 0,  1,  2, ..., 17, 18, 19], dtype=int32)

## mat文件数据


mat数据格式是Matlab的数据存储的标准格式

In [18]:
import scipy.io as sio

In [19]:
data_A = np.arange(20).reshape(4, 5)

In [None]:
sio.savemat?

In [20]:
file_name = 'first_mat.mat'
sio.savemat(file_name, {'A': data_A})  

In [21]:
data = sio.loadmat(file_name)
type(data)

dict

In [22]:
data

{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Mon Aug  3 23:02:33 2020',
 '__version__': '1.0',
 '__globals__': [],
 'A': array([[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]])}

In [23]:
data_new = data['A']
data_new

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

## numpy 数据存取

In [None]:
arr = np.arange(10)
# npy格式
np.save('array', arr)

In [None]:
array_new = np.load('array.npy')
array_new

In [None]:
# 简单文本格式
a = np.arange(9).reshape(3, 3)
np.savetxt('array.txt', a)

In [None]:
np.loadtxt?

In [None]:
np.loadtxt('array.txt')

In [None]:
# 从CSV 中读取
np.loadtxt('tips.csv', delimiter=',', usecols=(0, 1), skiprows=1)  #  Skip the first `skiprows` lines

In [None]:
p, d = np.loadtxt('tips.csv', delimiter=',', usecols=(0, 1), skiprows=1, unpack=True)

In [None]:
p.shape, d.shape

## pandas 数据存取

In [None]:
import pandas as pd

In [None]:
pd_data = pd.read_csv('tips.csv')

In [None]:
pd_data

In [None]:
# 保存为各种格式
pd_data.to_pickle('data.pickle')
pd_data.to_json('data.json')
pd_data.to_excel('data.xlsx')

In [None]:
pd_data_new = pd.read_json('data.json')
pd_data_new