Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
87b2b63
Create 令牌化.rst
BAOOOOOM Aug 23, 2021
4816fa4
Create PureTextTokenizer.ipynb
BAOOOOOM Aug 23, 2021
d46aa06
Create 令牌化.rst
BAOOOOOM Aug 23, 2021
bc8db6a
Create PureTextTokenizer.ipynb
BAOOOOOM Aug 23, 2021
1d05290
Create pretrain.rst
BAOOOOOM Aug 23, 2021
ac3020d
Delete docs/source/ap directory
BAOOOOOM Aug 23, 2021
53c5da6
Create Pretrain.rst
BAOOOOOM Aug 23, 2021
13d1ace
Create PureTextTokenizer.rst
BAOOOOOM Aug 23, 2021
9fd1212
Create pretrain.rst
BAOOOOOM Aug 23, 2021
2798fc5
Create index.rst
BAOOOOOM Aug 23, 2021
c309caa
Create ModelZoo.rst
BAOOOOOM Aug 23, 2021
a5eac35
Create index.rst
BAOOOOOM Aug 23, 2021
ff588d8
Create tokenizer.rst
BAOOOOOM Aug 23, 2021
4d32acd
Create vector.rst
BAOOOOOM Aug 23, 2021
e9a117e
Create utils.rst
BAOOOOOM Aug 23, 2021
6fd3843
Create pretrain.rst
BAOOOOOM Aug 23, 2021
bd2f7f1
Create index.rst
BAOOOOOM Aug 23, 2021
58423c2
Create pretrain.rst
BAOOOOOM Aug 23, 2021
c8dfd43
Create index.rst
BAOOOOOM Aug 23, 2021
7d6528e
Create pretrain.rst
BAOOOOOM Aug 23, 2021
cca5870
Create ModelZoo.rst
BAOOOOOM Aug 23, 2021
f8f0166
Create pretrain.rst
BAOOOOOM Aug 23, 2021
9e0d355
Create index.rst
BAOOOOOM Aug 23, 2021
6eb9e7c
Create gensim_vec.py
BAOOOOOM Aug 23, 2021
4c968c3
Create gensim_vec.py
BAOOOOOM Aug 23, 2021
24cc07d
Create gensim_vec.py
BAOOOOOM Aug 23, 2021
c7d4514
Create 语义成分分解.rst
BAOOOOOM Aug 23, 2021
9cd6f06
Create d2v_bow_tfidf.ipynb
BAOOOOOM Aug 23, 2021
5dd95d2
Create w2v_stem_text.ipynb
BAOOOOOM Aug 23, 2021
f55a876
Create sif.ipynb
BAOOOOOM Aug 23, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions EduNLP/Pretrain/gensim_vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,7 @@


class GensimWordTokenizer(object):
def __init__(self, symbol="gm", general=False):
"""
"""

Parameters
----------
Expand Down Expand Up @@ -44,7 +43,8 @@ def __init__(self, symbol="gm", general=False):
... 若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', '[FORMULA]', '约束条件', '公式', '[FORMULA]', '[SEP]', '[FORMULA]']
"""
"""
def __init__(self, symbol="gm", general=False):
self.symbol = symbol
if general is True:
self.tokenization_params = {
Expand Down Expand Up @@ -72,15 +72,15 @@ def __call__(self, item):


class GensimSegTokenizer(object): # pragma: no cover
def __init__(self, symbol="gms", depth=None, flatten=False, **kwargs):
"""
"""

Parameters
----------
symbol:
gms
fgm
"""
"""
def __init__(self, symbol="gms", depth=None, flatten=False, **kwargs):
self.symbol = symbol
self.tokenization_params = {
"formula_params": {
Expand Down
16 changes: 16 additions & 0 deletions docs/source/api/ModelZoo.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
EduNLP.ModelZoo
==============

rnn
-----------

.. automodule:: EduNLP.ModelZoo.rnn
:members:
:imported-members:

utils
-----------

.. automodule:: EduNLP.ModelZoo.utils
:members:
:imported-members:
41 changes: 41 additions & 0 deletions docs/source/api/index.rst
Original file line number Diff line number Diff line change
@@ -1,2 +1,43 @@
EduNLP
======

SIF
----------------------
.. automodule:: EduNLP.SIF.sif
:members:
:imported-members:

EduNLP.Formula
---------------------

.. automodule:: EduNLP.Formula.ast
:members:
:imported-members:

EduNLP.I2V
-----------------

.. automodule:: EduNLP.I2V.i2v
:members:
:imported-members:

EduNLP.Pretrain
-------------------

.. automodule:: EduNLP.Pretrain
:members:
:imported-members:

EduNLP.Tokenizer
----------------------

.. automodule:: EduNLP.Tokenizer
:members:
:imported-members:

Vector
---------------

.. automodule:: EduNLP.Vector
:members:
:imported-members:
6 changes: 6 additions & 0 deletions docs/source/api/pretrain.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
EduNLP.Pretrain
==================

.. automodule:: EduNLP.Pretrain
:members:
:imported-members:
6 changes: 6 additions & 0 deletions docs/source/api/tokenizer.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
EduNLP.Tokenizer
=====================================

.. automodule:: EduNLP.Tokenizer
:members:
:imported-members:
6 changes: 6 additions & 0 deletions docs/source/api/utils.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
EduNLP.utils
====================

.. automodule:: EduNLP.utils
:members:
:imported-members:
16 changes: 16 additions & 0 deletions docs/source/api/vector.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
EduNLP.Vector
==========================

Vector
---------------

.. automodule:: EduNLP.Vector
:members:
:imported-members:

rnn
-----------

.. automodule:: EduNLP.Vector.rnn
:members:
:imported-members:
6 changes: 6 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -167,4 +167,10 @@ If this repository is helpful for you, please cite our work
api/index
api/i2v
api/sif
api/tokenizer
api/formula
api/pretrain
api/ModelZoo
api/vector
api/utils

7 changes: 0 additions & 7 deletions docs/source/tutorial/zh/seg/语义成分分解.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,10 +46,3 @@
>>> dict2str4sif(item, key_as_tag=False)
'若复数$z=1+2 i+i^{3}$,则$|z|=$0$\\SIFSep$1$\\SIFSep$$\\sqrt{2}$$\\SIFSep$2'

详细示范
++++++++++++++++++++++

.. toctree::
:titlesonly:

语义成分分解的案例 <../../../build/blitz/utils/data.ipynb>
31 changes: 31 additions & 0 deletions docs/source/tutorial/zh/tokenization/PureTextTokenizer.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
PureTextTokenizer
================

即纯净型文本令牌解析器,在默认情况下对传入的item中的图片、标签、分隔符、题目空缺符等部分则转换成特殊字符进行保护,并对特殊公式(例如:$\\FormFigureID{...}$, $\\FormFigureBase64{...}$)进行筛除,从而对文本、纯文本公式进行令牌化操作。此外,此令牌解析器对文本、公式均采用线性的分析方法,并提供的key参数用于对传入的item进行预处理,待未来根据需求进行开发。

Examples
----------

::

>>> tokenizer = PureTextTokenizer()
>>> items = ["有公式$\\FormFigureID{wrong1?}$,如图$\\FigureID{088f15ea-xxx}$,\
... 若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$"]
>>> tokens = tokenizer(items)
>>> next(tokens)[:10]
['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z']
>>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"]
>>> tokens = tokenizer(items)
>>> next(tokens) # doctest: +NORMALIZE_WHITESPACE
['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<',
'0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',',
'\\quad', 'A', '\\cap', 'B', '=']
>>> items = [{
... "stem": "已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$",
... "options": ["1", "2"]
... }]
>>> tokens = tokenizer(items, key=lambda x: x["stem"])
>>> next(tokens) # doctest: +NORMALIZE_WHITESPACE
['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<',
'0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',',
'\\quad', 'A', '\\cap', 'B', '=']
3 changes: 2 additions & 1 deletion docs/source/tutorial/zh/tokenize/令牌化.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,13 @@ Examples



通过查看"./EduNLP/Tokenizer/tokenizer.py"及"./EduNLP/Pretrain/gensim_vec.py"可以查看更多令牌化器,下面是一个完整的令牌化器列表
通过查看 ``./EduNLP/Tokenizer/tokenizer.py`` 及 ``./EduNLP/Pretrain/gensim_vec.py`` 可以查看更多令牌化器,下面是一个完整的令牌化器列表

.. toctree::
:maxdepth: 1
:titlesonly:

../tokenization/TextTokenizer
../tokenization/PureTextTokenizer
../tokenization/GensimSegTokenizer
../tokenization/GensimWordTokenizer
Loading