Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions EduNLP/I2V/i2v.py
Original file line number Diff line number Diff line change
Expand Up @@ -260,11 +260,19 @@ def from_pretrained(cls, name, model_dir=MODEL_DIR, *args, **kwargs):

def get_pretrained_i2v(name, model_dir=MODEL_DIR):
"""
It is a good idea if you want to switch item to vector earily.

Parameters
-----------
name: str
the name of item2vector model
e.g.:
d2v_all_256
d2v_sci_256
d2v_eng_256
d2v_lit_256
w2v_sci_300
w2v_lit_300
model_dir:str
the path of model, default: MODEL_DIR = '~/.EduNLP/model'

Expand Down
55 changes: 26 additions & 29 deletions EduNLP/Tokenizer/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,19 @@ def __call__(self, *args, **kwargs):

class PureTextTokenizer(Tokenizer):
r"""
Duel with text and plain text formula.
And filting special formula like $\\FormFigureID{…}$ and $\\FormFigureBase64{…}.

Parameters
----------
items: str
key
args
kwargs

Returns
-------
token

Examples
--------
Expand Down Expand Up @@ -49,27 +62,24 @@ def __init__(self, *args, **kwargs):
}

def __call__(self, items: Iterable, key=lambda x: x, *args, **kwargs):
"""
Duel with text and plain text formula.
And filting special formula like $\\FormFigureID{…}$ and $\\FormFigureBase64{…}.

Parameters
----------
items: str
key
args
kwargs

Returns
-------
token
"""
for item in items:
yield tokenize(seg(key(item), symbol="gmas"), **self.tokenization_params).tokens


class TextTokenizer(Tokenizer):
r"""
Duel with text and formula including special formula.

Parameters
----------
items: str
key
args
kwargs

Returns
-------
token

Examples
----------
Expand All @@ -95,20 +105,6 @@ def __init__(self, *args, **kwargs):
}

def __call__(self, items: Iterable, key=lambda x: x, *args, **kwargs):
"""
Duel with text and formula including special formula.

Parameters
----------
items: str
key
args
kwargs

Returns
-------
token
"""
for item in items:
yield tokenize(seg(key(item), symbol="gmas"), **self.tokenization_params).tokens

Expand All @@ -121,6 +117,7 @@ def __call__(self, items: Iterable, key=lambda x: x, *args, **kwargs):

def get_tokenizer(name, *args, **kwargs):
r"""
It is a total interface to use difference tokenizer.

Parameters
----------
Expand Down
1 change: 1 addition & 0 deletions EduNLP/Vector/gensim_vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

class W2V(Vector):
"""
The part uses gensim library providing FastText, Word2Vec and KeyedVectors method to transfer word to vector.

Parameters
----------
Expand Down
3 changes: 3 additions & 0 deletions EduNLP/Vector/t2v.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@

class T2V(object):
"""
The function aims to transfer token list to vector. If you have a certain model, you can use T2V directly. \
Otherwise, calling get_pretrained_t2v function is a better way to get vector which can switch it without your model.

Parameters
----------
Expand Down Expand Up @@ -73,6 +75,7 @@ def vector_size(self) -> int:

def get_pretrained_t2v(name, model_dir=MODEL_DIR):
"""
It is a good idea if you want to switch token list to vector earily.

Parameters
----------
Expand Down
2 changes: 2 additions & 0 deletions EduNLP/utils/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@

@contextmanager
def add_annotation(key, tag_mode, tar: list, key_as_tag=True):
"""add tag"""
if key_as_tag is True:
if tag_mode == "delimiter":
tar.append(ann_begin_format.format(key))
Expand All @@ -26,6 +27,7 @@ def add_annotation(key, tag_mode, tar: list, key_as_tag=True):

def dict2str4sif(obj: dict, key_as_tag=True, tag_mode="delimiter", add_list_no_tag=True, keys=None) -> str:
r"""
The function aims to transfer dictionary format item to string format item.

Parameters
----------
Expand Down
8 changes: 4 additions & 4 deletions docs/source/tutorial/zh/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,14 @@

.. figure:: ../../_static/新流程图.png

* **语法解析**:其作用是将传入的item转换为标准sif格式(即把字母、数字用 ``$...$`` 包裹起来,把选择填空的括号、下划线转换为特殊符号等)。
* `语法解析 <parse.rst>`_ :其作用是将传入的item转换为标准sif格式(即把字母、数字用 ``$...$`` 包裹起来,把选择填空的括号、下划线转换为特殊符号等)。

* **成分分解**:其作用是将传入的符合sif标准的item根据元素种类进行分割开来,从而服务于后面的令牌化环节(即可以将不同类型元素使用各自的方法令牌化)。
* `成分分解 <seg.rst>`_ :其作用是将传入的符合sif标准的item根据元素种类进行分割开来,从而服务于后面的令牌化环节(即可以将不同类型元素使用各自的方法令牌化)。

* **令牌化**:其作用是将传入的经过分词后的item元素列表进行令牌化分解,从而服务于后面的向量化模块。
* `令牌化 <tokenize.rst>`_:其作用是将传入的经过分词后的item元素列表进行令牌化分解,从而服务于后面的向量化模块。
其中通常情况下直接使用文本形式的令牌化方法即可,对于公式而言还可使用ast方法进行解析(调用formula模块);

* **向量化**:此部分主要调用的是I2V类及其子类,其作用是将传入的令牌化后的item元素列表进行向量化操作,最终即可得到相应的静态向量。
* `向量化 <vectorization.rst>`_:此部分主要调用的是I2V类及其子类,其作用是将传入的令牌化后的item元素列表进行向量化操作,最终即可得到相应的静态向量。
对于向量化模块来说,可以调用自己训练好的模型,也可直接调用提供的预训练模型(调用get_pretrained_i2v模块即可)。

* **下游模型**:将得到的向量进一步处理,从而得到所需的结果。
Expand Down
4 changes: 2 additions & 2 deletions docs/source/tutorial/zh/parse.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,9 @@

1.匹配公式之外的英文字母、数字,只对两个汉字之间的字母、数字做修正,其余匹配到的情况视为不合 latex 语法录入的公式

2.匹配“( )”型括号(包含英文格式和中文格式),即括号内无内容或为空格的括号,将括号替换$\\SIFChoice$
2.匹配“( )”型括号(包含英文格式和中文格式),即括号内无内容或为空格的括号,将括号替换 ``$\\SIFChoice$``

3.匹配下划线,替换连续的下划线或下划线中夹杂空格的情况,将其替换为$\\SIFBlank$
3.匹配下划线,替换连续的下划线或下划线中夹杂空格的情况,将其替换为 ``$\\SIFBlank$``

4.匹配latex公式,主要检查latex公式的完整性和可解析性,对latex 中出现中文字符发出警告

Expand Down