## 单字特征的解读

本教程讨论 HWDB1.0\~1.1 以及 OLHWDB1.0\~1.1，下载到个人电脑目录下,你可以根据自己的实际目录来调整设置：

In [None]:
import sys
sys.path

In [None]:
# del sys.path['']
del sys.path[5]

In [None]:
sys.path

In [None]:
# CASIA 数据集所在根目录
root = './datas_zips/'

载入本教程需要使用的包：

In [None]:
# ERROR:: Could not find a local HDF5 installation.
#          You may need to explicitly state where your local HDF5 headers and
#          library can be found by setting the ``HDF5_DIR`` environment
#          variable or by using the ``--hdf5`` command-line option.

# 我的 Mac 上会出现这个问题，在安装 tables 的时候

# 找到了一个 sof 上的解决方案。

# https://stackoverflow.com/questions/73029883/could-not-find-hdf5-installation-for-pytables-on-m1-mac

!pip install cython
!brew install hdf5
!brew install c-blosc
!export HDF5_DIR=/opt/homebrew/opt/hdf5
!export BLOSC_DIR=/opt/homebrew/opt/c-blosc

In [None]:
# 安装下面的 tables，首先需要在本地安装 hdf5，在 mac 上的安装方式为 brew install hdf5
!pip install tables

![1iXIrJ](https://oss.images.shujudaka.com/uPic/1iXIrJ.png)

In [None]:
import struct
from pathlib import Path
from zipfile import ZipFile
import numpy as np

In [None]:
import tables as tb

`Path` 更加友好的管理文件的路径：

In [None]:
root = Path(root)
# 查看 root 的全部文件
zips = []
for pname in root.iterdir():
    for name in Path(pname).iterdir():
        zips.append(name.parts[-1])

zips

每个单字的特征均以 `.mpf` 形式保存手工特征，可以看出上述文件均为压缩包，下面使用 `zipfile` 对压缩文件进行解读：

In [None]:
z = ZipFile(list(root.glob('**/HWDB1.0trn.zip'))[0])
z.namelist()[1:5] # 查看前4个人写的 MPF

In [None]:
len(z.namelist())

In [None]:
z.namelist()

载入 MPF 的解码器：MPFDecoder

In [None]:
from casia.feature import MPFDecoder, zipfile2bunch

### 将 MPF 转换为 bunch

In [None]:
zip_name = list(root.glob('**/HWDB1.0trn.zip'))[0]
zip_name

In [None]:
mb = zipfile2bunch(zip_name)

In [None]:
# 将数据集进行输出
for key,value in mb.items():
    print(key,value)

In [None]:
# 查看图片
mb['HWDB1.0trn/001.mpf']['dataset']

In [None]:
# 寻求它的类型
type(mb['HWDB1.0trn/001.mpf']['dataset'])

In [None]:
df = mb['HWDB1.0trn/001.mpf']['dataset']
df.iloc[0, :].values

### 将 bunch 转换为 JSON

需要安装

```
pip3 install torch torchvision torchaudio
```

In [None]:
from loader.utils.dataset import bunch2json,json2bunch 

In [None]:
data_dir = Path('data')
if not data_dir.exists(): # 如果不存在
    data_dir.mkdir() # 创建目录

In [None]:
%%time
json_path = 'data/features.json'
bunch2json(mb, json_path)

In [None]:
%%time
# 再次载入数据
mpf_bunch = json2bunch(json_path)

### 将 bunch 转换为 HDF5

In [None]:
def bunch2hdf(bunch, save_path):
    '''将 bunch 转换为 HDF5'''
    filters = tb.Filters(complevel=7, shuffle=False)  # 过滤信息，用于压缩文件
    h = tb.open_file(save_path, 'w', filters=filters, title='Xinet\'s dataset')
    for name in bunch:  # 生成数据集"头"
        _name = name.replace('/', '__')
        _name = _name.replace('.', '_')
        h.create_group('/', name=_name, filters=filters)
        h.create_array(f"/{_name}", 'text',
                       bunch[name]['text'].encode())
        features = bunch[name]['dataset']
        h.create_array(f"/{_name}", 'labels',
                       " ".join([l for l in features.index]).encode())
        h.create_array(f"/{_name}", 'features', features.values)
    h.close()  # 防止资源泄露

In [None]:
%%time
hdf_path = 'data/features.h5'
bunch2hdf(mpf_bunch, hdf_path)

In [None]:
%%time
h5 = tb.open_file(hdf_path)

In [None]:
# 获取某个 mpf 的特征矩阵的 shape
h5.root.HWDB1_0trn__001_mpf.features[:].shape

In [None]:
# 获取某个 mpf 的特征介绍
h5.root.HWDB1_0trn__001_mpf.text.read()

In [None]:
# 获取某个 mpf 的标签信息
labels = h5.root.HWDB1_0trn__001_mpf.labels.read().decode()
labels = np.array(labels.split(' '))
labels

### 测试 JSON 与 HDF5 的文件大小

In [None]:
import os
from sys import getsizeof


print(
    f"JSON Python 对象占用空间大小为 {getsizeof(mpf_bunch)/1e3} kB, 文件大小为 {os.path.getsize(json_path)/1e9} G")
print(
    f"HDF5 Python 对象占用空间大小为 {getsizeof(h5)} B, 文件大小为 {os.path.getsize(hdf_path)/1e9} G")

In [None]:
h5.close()  # 关闭

## 打包多个 zip 文件

In [None]:
root2 = Path("./datas_zips/Character_Sample_Data/")
zip_gnt_names = set(root2.glob('*Gnt*.zip')) # GNT 名称列表
zip_pot_names = set(root2.glob('*Pot*.zip')) # POT名称列表

# 查看 root 的全部文件
alls = []
for pname in root.iterdir():
    for name in Path(pname).iterdir():
        alls.append(name)
alls = set(alls)
alls

In [None]:
root

In [None]:
zip_gnt_names

In [None]:
zip_pot_names

In [None]:
# MPF 名称列表
zip_mpf_names = alls - zip_pot_names - zip_gnt_names
zip_mpf_names

In [None]:
mpf_bunch = {}
for mpf_name in zip_mpf_names:
    mpf_bunch.update(zipfile2bunch(mpf_name))

保存为 JSON

In [None]:
%%time
json_path = 'data/features.json'
bunch2json(mpf_bunch, json_path)

保存为 HDF5

In [None]:
%%time
hdf_path = 'data/features.h5'
bunch2hdf(mpf_bunch, hdf_path)

载入 JSON

In [None]:
%%time
mpf_bunch = json2bunch(json_path)

载入 HDF5

In [None]:
%%time
h5 = tb.open_file(hdf_path)

### 再次测试文件大小

In [None]:
from sys import getsizeof


print(
    f"JSON Python 对象占用空间大小为 {getsizeof(mpf_bunch)/1e3} kB, 文件大小为 {Path(json_path).stat().st_size/1e9} G")
print(
    f"HDF5 Python 对象占用空间大小为 {getsizeof(h5)} B, 文件大小为 {Path(hdf_path).stat().st_size/1e9} G")

从上述的展示可以看出 HDF5 优于 JSON 与 ZipFile，所以下面仅仅考虑 HDF5 文件。

## 解析 features.h5

In [None]:
h5.get_filesize() # 获取文件大小

In [None]:
nodes = h5.list_nodes('/')  # 列出所有 MPF 数据

In [None]:
nodes[0]

In [None]:
len(nodes)  # 统计 MPF 个数

In [None]:
data_iter = h5.iter_nodes('/') # 所有 MPF 数据以迭代器的方式使用

In [None]:
next(data_iter) # 取出一个 MPF

### 获取 MPF 的特征矩阵与标签

In [None]:
mpf_name = 'HWDB1_0trn__007_mpf'
# 依据 MPF 的名称获取 MPF
mpf = h5.get_node('/', mpf_name)
mpf

In [None]:
def get_features(mpf):
    '''获取 MPF 的特征矩阵'''
    return mpf.features[:]


def get_labels(mpf):
    '''获取 MPF 的标签数组'''
    labels_str = mpf.labels.read().decode()
    return np.array(labels_str.split(' '))

In [None]:
features = get_features(mpf)  # 获取特征矩阵
labels = get_labels(mpf)      # 获取标签
h5.close()

In [None]:
features

In [None]:
labels

## MPF 迭代器

依据特征矩阵与标签函数，定义了 MPF 迭代器，获取方式：

In [None]:
class CASIAFeature:
    def __init__(self, hdf_path):
        '''casia 数据 MPF 特征处理工具'''
        self.h5 = tb.open_file(hdf_path)

    def _features(self, mpf):
        '''获取 MPF 的特征矩阵'''
        return mpf.features[:]

    def _labels(self, mpf):
        '''获取 MPF 的标签数组'''
        labels_str = mpf.labels.read().decode()
        return np.array(labels_str.split(' '))

    def __iter__(self):
        '''返回 (features, labels)'''
        for mpf in self.h5.iter_nodes('/'):
            yield self._features(mpf), self._labels(mpf)

### MPF 迭代器的使用方法

In [None]:
mpf_iter = CASIAFeature(hdf_path)
# 以迭代器的方式获取数据
for features, labels in mpf_iter:
    print(features.shape, labels.shape)
    break

In [None]:
h5

### 为了将 CASIA 划分为训练集与测试集，需要重新打包

重启 Kernel

In [1]:
import sys
import os
sys.path.append(os.getcwd()+"/loader")

In [2]:
sys.path

['/Users/lincolnmac16/Documents/GitHub/crnn-pytorch/datasets',
 '/Users/lincolnmac16/Documents/GitHub/crnn-pytorch',
 '/Users/lincolnmac16/opt/anaconda3/envs/crnn-pytorch/lib/python310.zip',
 '/Users/lincolnmac16/opt/anaconda3/envs/crnn-pytorch/lib/python3.10',
 '/Users/lincolnmac16/opt/anaconda3/envs/crnn-pytorch/lib/python3.10/lib-dynload',
 '',
 '/Users/lincolnmac16/opt/anaconda3/envs/crnn-pytorch/lib/python3.10/site-packages',
 '/Users/lincolnmac16/Documents/GitHub/crnn-pytorch/datasets/loader']

In [3]:
from casia.feature import CASIA

![fLtgb9](https://oss.images.shujudaka.com/uPic/fLtgb9.png)

In [4]:
# CASIA 数据集所在根目录
root = 'datas_zips/'
save_path = 'data/features.h5'

self = CASIA(root)  # 该类实现数据集的划分
self.bunch2hdf(save_path)

Train names->
{PosixPath('datas_zips/Feature_Data/HWDB1.1trn.zip'), PosixPath('datas_zips/Feature_Data/OLHWDB1.0trn.zip'), PosixPath('datas_zips/Feature_Data/OLHWDB1.1trn.zip'), PosixPath('datas_zips/Feature_Data/HWDB1.0trn.zip')}
Test  names->
{PosixPath('datas_zips/Feature_Data/OLHWDB1.1tst.zip'), PosixPath('datas_zips/Feature_Data/HWDB1.0tst.zip'), PosixPath('datas_zips/Feature_Data/OLHWDB1.0tst.zip'), PosixPath('datas_zips/Feature_Data/HWDB1.1tst.zip')}


In [5]:
%%time
# 载入 HDF5
import tables as tb
h5 = tb.open_file(save_path)

CPU times: user 1.59 ms, sys: 1.16 ms, total: 2.75 ms
Wall time: 2.66 ms


In [6]:
h5.root

/ (RootGroup) "Xinet's casia dataset"
  children := ['test' (Group), 'train' (Group)]

In [7]:
from casia.feature import CASIAFeature
mpf_dataset = CASIAFeature(save_path)
# 以测试集的迭代器的方式获取数据
for features, labels in mpf_dataset.test_iter():
    print(features.shape, labels.shape)
    break

(3726, 512) (3726,)
