Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
a73846d
Create conf.py
BAOOOOOM Aug 21, 2021
f159165
Add files via upload
BAOOOOOM Aug 21, 2021
d890696
Delete sif.png
BAOOOOOM Aug 21, 2021
eff6864
Add files via upload
BAOOOOOM Aug 21, 2021
276d431
Create conf.py
BAOOOOOM Aug 21, 2021
39c8ce6
Create conf.py
BAOOOOOM Aug 21, 2021
7127957
Create index.rst
BAOOOOOM Aug 21, 2021
edc8206
Create index.rst
BAOOOOOM Aug 21, 2021
b119b30
Add files via upload
BAOOOOOM Aug 21, 2021
949f9a2
Create parse.rst
BAOOOOOM Aug 21, 2021
f03b7e8
Create parse.rst
BAOOOOOM Aug 21, 2021
9121855
Create seg.rst
BAOOOOOM Aug 21, 2021
6c4e681
Create sif.rst
BAOOOOOM Aug 21, 2021
11e8627
Create sif.rst
BAOOOOOM Aug 21, 2021
03c2e6c
Create sif.rst
BAOOOOOM Aug 21, 2021
6ef1df4
Create sif.rst
BAOOOOOM Aug 21, 2021
275f0a0
Create sif.rst
BAOOOOOM Aug 21, 2021
6d6e029
Create sif.rst
BAOOOOOM Aug 21, 2021
1d4eb0d
Create sif.rst
BAOOOOOM Aug 21, 2021
ba1febb
Create sif.rst
BAOOOOOM Aug 21, 2021
ffb6e29
Create sif.rst
BAOOOOOM Aug 21, 2021
57c9811
Create sif.rst
BAOOOOOM Aug 21, 2021
d40a561
Create sif.rst
BAOOOOOM Aug 21, 2021
3c44eb4
Create sif.rst
BAOOOOOM Aug 21, 2021
9ee3a34
Create sif.rst
BAOOOOOM Aug 21, 2021
8456024
Create sif.rst
BAOOOOOM Aug 21, 2021
20aa133
Create sif.rst
BAOOOOOM Aug 21, 2021
64898b6
Create sif.rst
BAOOOOOM Aug 21, 2021
2668cf2
Create sif.rst
BAOOOOOM Aug 21, 2021
98988b9
Add files via upload
BAOOOOOM Aug 21, 2021
90e120b
Create 分词.rst
BAOOOOOM Aug 21, 2021
6533fd4
Create 分句.rst
BAOOOOOM Aug 21, 2021
ec57a31
Create 令牌化.rst
BAOOOOOM Aug 21, 2021
f964054
Create 令牌化.rst
BAOOOOOM Aug 21, 2021
824fe92
Create 令牌化.rst
BAOOOOOM Aug 21, 2021
fad5a72
Create 令牌化.rst
BAOOOOOM Aug 21, 2021
415b264
Create tokenize.rst
BAOOOOOM Aug 21, 2021
8011eb4
Create tokenize.rst
BAOOOOOM Aug 21, 2021
7f7d05f
Create tokenize.rst
BAOOOOOM Aug 21, 2021
7677c94
Add files via upload
BAOOOOOM Aug 21, 2021
b4fc768
Create 不使用预训练模型.txt
BAOOOOOM Aug 21, 2021
5feda19
Create 不使用预训练模型.txt
BAOOOOOM Aug 21, 2021
a32a7cd
Create 使用预训练模型.txt
BAOOOOOM Aug 21, 2021
e819706
Delete docs/source/tutorial/zh/vectorization directory
BAOOOOOM Aug 21, 2021
96e2799
Add files via upload
BAOOOOOM Aug 21, 2021
b96e1e2
Create vectorization.rst
BAOOOOOM Aug 21, 2021
a9e444e
Add files via upload
BAOOOOOM Aug 21, 2021
68bddc0
Create start.rst
BAOOOOOM Aug 21, 2021
2f9cc0c
Create loading.rst
BAOOOOOM Aug 21, 2021
b6101ab
Create pub.rst
BAOOOOOM Aug 21, 2021
c392a82
Create pretrain.rst
BAOOOOOM Aug 21, 2021
1a363f6
Create vectorization.rst
BAOOOOOM Aug 21, 2021
b42145f
Create vectorization.rst
BAOOOOOM Aug 21, 2021
1146873
Delete docs/source/tutorial/zh/vectorization directory
BAOOOOOM Aug 21, 2021
22bcdbb
Add files via upload
BAOOOOOM Aug 21, 2021
a2af455
Create 不使用预训练模型.rst
BAOOOOOM Aug 21, 2021
958a0e2
Create 使用预训练模型.rst
BAOOOOOM Aug 21, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added asset/_static/d2v.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added asset/_static/d2v_bow_tfidf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added asset/_static/d2v_general.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added asset/_static/d2v_stem_tf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added asset/_static/sif.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added asset/_static/w2v_stem_text.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added asset/_static/w2v_stem_tf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 10 additions & 2 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,16 +56,24 @@ def copy_tree(src, tar):
# npsphinx

nbsphinx_thumbnails = {
'build/blitz/sif/sif': '_static/item_figure.png',
'build/blitz/sif/sif': '_static/sif.png',
'build/blitz/sif/sif_addition': '_static/sif_addition.png',
'build/blitz/utils/data': '_static/data.png',
'build/blitz/formula/formula': '_static/formula.png',
'build/blitz/seg/seg': '_static/seg.png',
'build/blitz/parse/parse': '_static/parse.png',
'build/blitz/formula/formula': '_static/formula.png',
'build/blitz/tokenizer/tokenizer': '_static/tokenizer.png',
'build/blitz/pretrain/prepare_dataset': '_static/prepare_dataset.jpg',
'build/blitz/vectorization/i2v': '_static/i2v.png',
'build/blitz/pretrain/prepare_dataset': '_static/prepare_dataset.jpg',
'build/blitz/pretrain/gensim/d2v_bow_tfidf': '_static/d2v_bow_tfidf.png',
'build/blitz/pretrain/gensim/d2v_general': '_static/d2v_general.png',
'build/blitz/pretrain/gensim/d2v_stem_tf': '_static/d2v_stem_tf.png',
'build/blitz/pretrain/gensim/w2v_stem_text': '_static/w2v_stem_text.png',
'build/blitz/pretrain/gensim/w2v_stem_tf': '_static/w2v_stem_tf.png',
'build/blitz/pretrain/seg_token/d2v': '_static/d2v.png',
'build/blitz/pretrain/seg_token/d2v_d1': '_static/d2v_d1.png',
'build/blitz/pretrain/seg_token/d2v_d2': '_static/d2v_d2.png',
}

# Add any paths that contain templates here, relative to this directory.
Expand Down
4 changes: 1 addition & 3 deletions docs/source/tutorial/zh/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -120,8 +120,8 @@ gensim模型d2v例子
:name: rst2-gallery
:glob:

d2v_bow_tfidf <../../build/blitz/pretrain/gensim/d2v_bow_tfidf.ipynb>
d2v_general <../../build/blitz/pretrain/gensim/d2v_general.ipynb>
d2v_bow_tfidf <../../build/blitz/pretrain/gensim/d2v_bow_tfidf.ipynb>
d2v_stem_tf <../../build/blitz/pretrain/gensim/d2v_stem_tf.ipynb>


Expand All @@ -146,5 +146,3 @@ seg_token例子
:glob:

d2v.ipynb <../../build/blitz/pretrain/seg_token/d2v.ipynb>
d2v_d1 <../../build/blitz/pretrain/seg_token/d2v_d1.ipynb>
d2v_d2 <../../build/blitz/pretrain/seg_token/d2v_d2.ipynb>
1 change: 1 addition & 0 deletions docs/source/tutorial/zh/parse.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
--------------------

.. toctree::
:maxdepth: 1
:titlesonly:

文本语法结构解析 <parse/文本语法结构解析>
Expand Down
122 changes: 5 additions & 117 deletions docs/source/tutorial/zh/pretrain.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,125 +8,13 @@
* 如何加载预训练模型
* 公开的预训练模型


训练模型
---------

基本步骤
##################

1.确定模型的类型,选择适合的Tokenizer(GensimWordTokenizer、 GensimSegTokenizer),使之令牌化;

2.调用train_vector函数,即可得到所需的预训练模型。

Examples:

::

>>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True)
>>> token_item = tokenizer("有公式$\\FormFigureID{wrong1?}$,如图$\\FigureID{088f15ea-xxx}$,\
... 若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]']

# 10 dimension with fasstext method
train_vector(sif_items, "../../../data/w2v/gensim_luna_stem_tf_", 10, method="d2v")

装载模型
--------

将所得到的模型传入I2V模块即可装载模型

Examples:

::

>>> model_path = "../test_model/test_gensim_luna_stem_tf_d2v_256.bin"
>>> i2v = D2V("text","d2v",filepath=model_path, pretrained_t2v = False)


公开模型一览
------------

版本说明
##################

一级版本

* 公开版本1(luna_pub):高考
* 公开版本2( luna_pub_large):高考 + 地区试题

二级版本:

* 小科(Chinese,Math,English,History,Geography,Politics,Biology,Physics,Chemistry)
* 大科(理科science、文科literal、全科all)

三级版本:【待完成】

* 不使用第三方初始化词表
* 使用第三方初始化词表



模型命名规则:一级版本 + 二级版本 + gensim_luna_stem + 分词规则 + 模型方法 + 维度

Examples:

::

全量版本-全学科的D2V模型路径:
`/share/qlh/d2v_model/luna_pub/luna_pub_all_gensim_luna_stem_general_d2v_256.bin`
(备注:一个D2V模型含4个bin后缀的文件)

模型训练数据说明
##################

* 当前【词向量w2v】【句向量d2v】模型所用的数据均为 【高中学段】 的题目
* 测试数据:`[OpenLUNA.json] <http://base.ustc.edu.cn/data/OpenLUNA/OpenLUNA.json>`_

当前提供以下模型,更多分学科、分题型模型正在训练中,敬请期待
"d2v_all_256"(全科),"d2v_sci_256"(理科),"d2v_eng_256"(文科),"d2v_lit_256"(英语)

模型训练案例
------------

获得数据集
####################

.. toctree::
:maxdepth: 1
:titlesonly:

prepare_dataset <../../build/blitz/pretrain/prepare_dataset.ipynb>

gensim模型d2v例子
####################

.. toctree::
:maxdepth: 1
:titlesonly:

d2v_bow_tfidf <../../build/blitz/pretrain/gensim/d2v_bow_tfidf.ipynb>
d2v_general <../../build/blitz/pretrain/gensim/d2v_general.ipynb>
d2v_stem_tf <../../build/blitz/pretrain/gensim/d2v_stem_tf.ipynb>

gensim模型w2v例子
####################

.. toctree::
:maxdepth: 1
:titlesonly:

w2v_stem_text <../../build/blitz/pretrain/gensim/w2v_stem_text.ipynb>
w2v_stem_tf <../../build/blitz/pretrain/gensim/w2v_stem_tf.ipynb>

seg_token例子
####################
学习路线图
------------------

.. toctree::
:maxdepth: 1
:titlesonly:

d2v.ipynb <../../build/blitz/pretrain/seg_token/d2v.ipynb>
d2v_d1 <../../build/blitz/pretrain/seg_token/d2v_d1.ipynb>
d2v_d2 <../../build/blitz/pretrain/seg_token/d2v_d2.ipynb>
训练模型 <pretrain/start.rst>
装载模型 <pretrain/loading.rst>
公开模型一览 <pretrain/pub.rst>
11 changes: 11 additions & 0 deletions docs/source/tutorial/zh/pretrain/loading.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
装载模型
--------

将所得到的模型传入I2V模块即可装载模型

Examples:

::

>>> model_path = "../test_model/test_gensim_luna_stem_tf_d2v_256.bin"
>>> i2v = D2V("text","d2v",filepath=model_path, pretrained_t2v = False)
85 changes: 85 additions & 0 deletions docs/source/tutorial/zh/pretrain/pub.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
公开模型一览
------------

版本说明
##################

一级版本

* 公开版本1(luna_pub):高考
* 公开版本2( luna_pub_large):高考 + 地区试题

二级版本:

* 小科(Chinese,Math,English,History,Geography,Politics,Biology,Physics,Chemistry)
* 大科(理科science、文科literal、全科all)

三级版本:【待完成】

* 不使用第三方初始化词表
* 使用第三方初始化词表



模型命名规则:一级版本 + 二级版本 + gensim_luna_stem + 分词规则 + 模型方法 + 维度

Examples:

::

全量版本-全学科的D2V模型路径:
`/share/qlh/d2v_model/luna_pub/luna_pub_all_gensim_luna_stem_general_d2v_256.bin`
(备注:一个D2V模型含4个bin后缀的文件)

模型训练数据说明
##################

* 当前【词向量w2v】【句向量d2v】模型所用的数据均为 【高中学段】 的题目
* 测试数据:`[OpenLUNA.json] <http://base.ustc.edu.cn/data/OpenLUNA/OpenLUNA.json>`_

当前提供以下模型,更多分学科、分题型模型正在训练中,敬请期待
"d2v_all_256"(全科),"d2v_sci_256"(理科),"d2v_eng_256"(文科),"d2v_lit_256"(英语)

模型训练案例
------------

获得数据集
####################

.. toctree::
:maxdepth: 1
:titlesonly:

prepare_dataset <../../../build/blitz/pretrain/prepare_dataset.ipynb>

gensim模型d2v例子
####################

.. toctree::
:maxdepth: 1
:titlesonly:

d2v_bow_tfidf <../../../build/blitz/pretrain/gensim/d2v_bow_tfidf.ipynb>
d2v_general <../../../build/blitz/pretrain/gensim/d2v_general.ipynb>
d2v_stem_tf <../../../build/blitz/pretrain/gensim/d2v_stem_tf.ipynb>

gensim模型w2v例子
####################

.. toctree::
:maxdepth: 1
:titlesonly:

w2v_stem_text <../../../build/blitz/pretrain/gensim/w2v_stem_text.ipynb>
w2v_stem_tf <../../../build/blitz/pretrain/gensim/w2v_stem_tf.ipynb>

seg_token例子
####################

.. toctree::
:maxdepth: 1
:titlesonly:

d2v.ipynb <../../../build/blitz/pretrain/seg_token/d2v.ipynb>
d2v_d1 <../../../build/blitz/pretrain/seg_token/d2v_d1.ipynb>
d2v_d2 <../../../build/blitz/pretrain/seg_token/d2v_d2.ipynb>
22 changes: 22 additions & 0 deletions docs/source/tutorial/zh/pretrain/start.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
训练模型
---------

基本步骤
##################

1.确定模型的类型,选择适合的Tokenizer(GensimWordTokenizer、 GensimSegTokenizer),使之令牌化;

2.调用train_vector函数,即可得到所需的预训练模型。

Examples:

::

>>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True)
>>> token_item = tokenizer("有公式$\\FormFigureID{wrong1?}$,如图$\\FigureID{088f15ea-xxx}$,\
... 若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]']

# 10 dimension with fasstext method
train_vector(sif_items, "../../../data/w2v/gensim_luna_stem_tf_", 10, method="d2v")
1 change: 1 addition & 0 deletions docs/source/tutorial/zh/seg.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
--------------------

.. toctree::
:maxdepth: 1
:titlesonly:

语义成分分解 <seg/语义成分分解>
Expand Down
Loading