diff --git a/docs/source/_static/new_flow.png b/docs/source/_static/new_flow.png new file mode 100644 index 00000000..f103cc7d Binary files /dev/null and b/docs/source/_static/new_flow.png differ diff --git a/docs/source/index.rst b/docs/source/index.rst index 27e9ed8e..83dcb4fd 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -143,6 +143,11 @@ If this repository is helpful for you, please cite our work tutorial/en/index tutorial/en/sif + tutorial/en/parse + tutorial/en/seg + tutorial/en/tokenize + tutorial/en/pretrain + tutorial/en/vectorization .. toctree:: :maxdepth: 1 diff --git a/docs/source/tutorial/en/index.rst b/docs/source/tutorial/en/index.rst index 108a9487..e75e48f1 100644 --- a/docs/source/tutorial/en/index.rst +++ b/docs/source/tutorial/en/index.rst @@ -1,2 +1,52 @@ Get Started -=========== +=============== + +* `Standard Item Format `_ + +* `Syntax Parsing `_ + +* `Component Segmentation `_ + +* `Tokenization `_ + +* `Pre-training `_ + +* `Vectorization `_ + +Main process +--------------- + +.. figure:: ../../_static/new_flow.png + +* `Syntax Parsing `_ : Its function is to convert the incoming item into SIF format, which means letters and numbers should be between ``$...$`` and the brackets and underlines of the choice questions should be converted to special symbols we defined in SIF) + +* `Component Segmentation `_ : Its function is to segment items in SIF format according to the types of items, so as to serve the later tokenization module.(that is, elements in different types can be tokenized using their corresponding methods)。 + +* `Tokenization `_: Its function is to tokenize segmented items, so as to serve the later tokenization module. + Generally, the tokenization method in the text form can be used directly. For formulas, the ast method can also be used for parsing(call the formula module). + +* `Vectorization `_: This part mainly calls I2V class and its subclasses. Its function is to vectorize the list of tokenized items, so as to get the corresponding static vectors. + For vectorization module, You can call your own trained model or directly call the provided pre-training model(call get_pretrained_I2V module). + +* **Downstream Model**:Process the obtained vectors to get the desired results. + +Examples +--------- + +To help you quickly understand the functions of this project, this section only shows the usages of common function interface. Intermediate function modules (such as parse, formula, segment, etc.) and more subdivided interface methods are not shown. For further study, please refer to relevant documents. + +.. nbgallery:: + :caption: This is a thumbnail gallery: + :name: tokenize_gallery + :glob: + + Tokenization <../../build/blitz/tokenizer/tokenizer.ipynb> + + + +.. nbgallery:: + :caption: This is a thumbnail gallery: + :name: vectorization_gallery + :glob: + + Vectorization <../../build/blitz/vectorization/total_vector.ipynb> diff --git a/docs/source/tutorial/en/parse.rst b/docs/source/tutorial/en/parse.rst new file mode 100644 index 00000000..5aba283d --- /dev/null +++ b/docs/source/tutorial/en/parse.rst @@ -0,0 +1,291 @@ +Syntax Parsing +================= + +In educational resources, texts and formulas have internal implicit or explicit syntax structures. It is of great benefit for further processing to extract these structures. + +* Text syntax structure parsing + +* Formula syntax structure parsing + +The purpose is as follows: + + +1. Represent underlines of blanks and brackets of choices with special identifiers. And the alphabets and formulas should be wrapped with $$, so that items of different types can be cut accurately through the symbol $. +2. Determine whether the current item is legal and report the error type. + +Specific processing content +-------------------------------- + +1.Its function is to match alphabets and numbers other than formulas. Only the alphabets and numbers between two Chinese characters will be corrected, and the rest of the cases are regarded as formulas that do not conform to latex syntax. + +2.Match brackets like "( )" (both English format and Chinese format), that is, brackets with no content or spaces, which should be replaced with ``$\\SIFChoice$`` + +3.Match continuous underscores or underscores with spaces and replace them with ``$\\SIFBlank$``. + +4.Match latex formulas,check the completeness and analyzability of latex formulas, and report an error for illegal formula. + +Formula syntax structure parsing +------------------------------------- + +This section is mainly realized by EduNLP. Formula modules, which can determine if the text has syntax errors and convert the syntax formula into the form of ast tree. In practice, this module is often used as part of an intermediate process, and the relevant parameters of this module can be automatically chosen by calling the corresponding model, so it generally does not need special attention. + +Introduction of Main Introduction ++++++++++++++++++++++++++++++++++++++++ + +1.Formula: determine whether the single formula passed in is in str form. If so, use the ast method for processing, otherwise an error will be reported. In addition, parameter variable_standardization is given. If this parameter is true, the variable standardization method will be used to make sure the same variable has the same variable number. + +2.FormulaGroup: If you need to pass in a formula set, you can call this interface to get an ast forest. The tree structure in the forest is the same as that of Formula. + +Formula +>>>>>>>>>>>> + +Formula: firstly, in the word segmentation function, the formula of the original text is segmented. In addition, ``Formula parse tree`` function is provided, which can represent the abstract syntax analysis tree of mathematical formula in the form of text or picture. + +This module also provides the function of formula variable standardization, such as determining whether 'x' in several sub formulas is the same variable. + +Import modules ++++++++++++++++++++++ + +:: + + import matplotlib.pyplot as plt + from EduNLP.Formula import Formula + from EduNLP.Formula.viz import ForestPlotter + +Initialization ++++++++++++++++ + +Incoming parameters: item + +Item is the latex formula or the abstract syntax parse tree generated after the formula is parsed and its type is str or List[Dict]. + +:: + + >>> f=Formula("x^2 + x+1 = y") + >>> f + + +View the specific content after formula segmentation +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +- View node elements after formula segmentation + +:: + + >>> f.elements + [{'id': 0, 'type': 'supsub', 'text': '\\supsub', 'role': None}, + {'id': 1, 'type': 'mathord', 'text': 'x', 'role': 'base'}, + {'id': 2, 'type': 'textord', 'text': '2', 'role': 'sup'}, + {'id': 3, 'type': 'bin', 'text': '+', 'role': None}, + {'id': 4, 'type': 'mathord', 'text': 'x', 'role': None}, + {'id': 5, 'type': 'bin', 'text': '+', 'role': None}, + {'id': 6, 'type': 'textord', 'text': '1', 'role': None}, + {'id': 7, 'type': 'rel', 'text': '=', 'role': None}, + {'id': 8, 'type': 'mathord', 'text': 'y', 'role': None}] + +- View the abstract parse tree of formulas + +:: + + >>> f.ast + [{'val': {'id': 0, 'type': 'supsub', 'text': '\\supsub', 'role': None}, + 'structure': {'bro': [None, 3],'child': [1, 2],'father': None,'forest': None}}, + {'val': {'id': 1, 'type': 'mathord', 'text': 'x', 'role': 'base'}, + 'structure': {'bro': [None, 2], 'child': None, 'father': 0, 'forest': None}}, + {'val': {'id': 2, 'type': 'textord', 'text': '2', 'role': 'sup'}, + 'structure': {'bro': [1, None], 'child': None, 'father': 0, 'forest': None}}, + {'val': {'id': 3, 'type': 'bin', 'text': '+', 'role': None}, + 'structure': {'bro': [0, 4], 'child': None, 'father': None, 'forest': None}}, + {'val': {'id': 4, 'type': 'mathord', 'text': 'x', 'role': None}, + 'structure': {'bro': [3, 5], 'child': None, 'father': None, 'forest': None}}, + {'val': {'id': 5, 'type': 'bin', 'text': '+', 'role': None}, + 'structure': {'bro': [4, 6], 'child': None, 'father': None, 'forest': None}}, + {'val': {'id': 6, 'type': 'textord', 'text': '1', 'role': None}, + 'structure': {'bro': [5, 7], 'child': None, 'father': None, 'forest': None}}, + {'val': {'id': 7, 'type': 'rel', 'text': '=', 'role': None}, + 'structure': {'bro': [6, 8], 'child': None, 'father': None, 'forest': None}}, + {'val': {'id': 8, 'type': 'mathord', 'text': 'y', 'role': None}, + 'structure': {'bro': [7, None],'child': None,'father': None,'forest': None}}] + + >>> print('nodes: ',f.ast_graph.nodes) + nodes: [0, 1, 2, 3, 4, 5, 6, 7, 8] + >>> print('edges: ' ,f.ast_graph.edges) + edges: [(0, 1), (0, 2)] + +- show the abstract parse tree by a picture + +:: + + >>> ForestPlotter().export(f.ast_graph, root_list=[node["val"]["id"] for node in f.ast if node["structure"]["father"] is None],) + >>> plt.show() + + +.. figure:: ../../_static/formula.png + + +Variable standardization ++++++++++++++++++++++++++++++ + +This parameter makes the same variable have the same variable number. + +For example: the number of variable ``x`` is ``0`` and the number of variable ``y`` is ``1``. + +:: + + >>> f.variable_standardization().elements + [{'id': 0, 'type': 'supsub', 'text': '\\supsub', 'role': None}, + {'id': 1, 'type': 'mathord', 'text': 'x', 'role': 'base', 'var': 0}, + {'id': 2, 'type': 'textord', 'text': '2', 'role': 'sup'}, + {'id': 3, 'type': 'bin', 'text': '+', 'role': None}, + {'id': 4, 'type': 'mathord', 'text': 'x', 'role': None, 'var': 0}, + {'id': 5, 'type': 'bin', 'text': '+', 'role': None}, + {'id': 6, 'type': 'textord', 'text': '1', 'role': None}, + {'id': 7, 'type': 'rel', 'text': '=', 'role': None}, + {'id': 8, 'type': 'mathord', 'text': 'y', 'role': None, 'var': 1}] + +FormulaGroup +>>>>>>>>>>>>>>> + +Call ``FormulaGroup`` class to parse the equations. The related attributes and functions are the same as those above. + +:: + + import matplotlib.pyplot as plt + from EduNLP.Formula import Formula + from EduNLP.Formula import FormulaGroup + from EduNLP.Formula.viz import ForestPlotter + >>> fs = FormulaGroup(["x^2 = y", "x^3 = y^2", "x + y = \pi"]) + >>> fs + ;;> + >>> fs.elements + [{'id': 0, 'type': 'supsub', 'text': '\\supsub', 'role': None}, + {'id': 1, 'type': 'mathord', 'text': 'x', 'role': 'base'}, + {'id': 2, 'type': 'textord', 'text': '2', 'role': 'sup'}, + {'id': 3, 'type': 'rel', 'text': '=', 'role': None}, + {'id': 4, 'type': 'mathord', 'text': 'y', 'role': None}, + {'id': 5, 'type': 'supsub', 'text': '\\supsub', 'role': None}, + {'id': 6, 'type': 'mathord', 'text': 'x', 'role': 'base'}, + {'id': 7, 'type': 'textord', 'text': '3', 'role': 'sup'}, + {'id': 8, 'type': 'rel', 'text': '=', 'role': None}, + {'id': 9, 'type': 'supsub', 'text': '\\supsub', 'role': None}, + {'id': 10, 'type': 'mathord', 'text': 'y', 'role': 'base'}, + {'id': 11, 'type': 'textord', 'text': '2', 'role': 'sup'}, + {'id': 12, 'type': 'mathord', 'text': 'x', 'role': None}, + {'id': 13, 'type': 'bin', 'text': '+', 'role': None}, + {'id': 14, 'type': 'mathord', 'text': 'y', 'role': None}, + {'id': 15, 'type': 'rel', 'text': '=', 'role': None}, + {'id': 16, 'type': 'mathord', 'text': '\\pi', 'role': None}] + >>> fs.ast + [{'val': {'id': 0, 'type': 'supsub', 'text': '\\supsub', 'role': None}, + 'structure': {'bro': [None, 3], + 'child': [1, 2], + 'father': None, + 'forest': None}}, + {'val': {'id': 1, 'type': 'mathord', 'text': 'x', 'role': 'base'}, + 'structure': {'bro': [None, 2], + 'child': None, + 'father': 0, + 'forest': [6, 12]}}, + {'val': {'id': 2, 'type': 'textord', 'text': '2', 'role': 'sup'}, + 'structure': {'bro': [1, None], 'child': None, 'father': 0, 'forest': None}}, + {'val': {'id': 3, 'type': 'rel', 'text': '=', 'role': None}, + 'structure': {'bro': [0, 4], 'child': None, 'father': None, 'forest': None}}, + {'val': {'id': 4, 'type': 'mathord', 'text': 'y', 'role': None}, + 'structure': {'bro': [3, None], + 'child': None, + 'father': None, + 'forest': [10, 14]}}, + {'val': {'id': 5, 'type': 'supsub', 'text': '\\supsub', 'role': None}, + 'structure': {'bro': [None, 8], + 'child': [6, 7], + 'father': None, + 'forest': None}}, + {'val': {'id': 6, 'type': 'mathord', 'text': 'x', 'role': 'base'}, + show more (open the raw output data in a text editor) ... + >>> fs.variable_standardization()[0] + [{'id': 0, 'type': 'supsub', 'text': '\\supsub', 'role': None}, {'id': 1, 'type': 'mathord', 'text': 'x', 'role': 'base', 'var': 0}, {'id': 2, 'type': 'textord', 'text': '2', 'role': 'sup'}, {'id': 3, 'type': 'rel', 'text': '=', 'role': None}, {'id': 4, 'type': 'mathord', 'text': 'y', 'role': None, 'var': 1}] + >>> ForestPlotter().export(fs.ast_graph, root_list=[node["val"]["id"] for node in fs.ast if node["structure"]["father"] is None],) + +.. figure:: ../../_static/formulagroup.png + + +Text syntax structure parsing +------------------------------------ + +This section is mainly realized by EduNLP.SIF.Parse module. Its main function is to extract letters and numbers in the text and convert them into standard format. + +This module is mainly used as an *middle module* to parse the input text. In general, users do not call this module directly. + +Introduction of main content ++++++++++++++++++++++++++++++++++++ + +1. Judge the type of the incoming text in the following order + +* is_chinese: its function is to match Chinese characters[\u4e00-\u9fa5]. + +* is_alphabet: its function is to match alphabets other than formulas. Only the alphabets between two Chinese characters will be corrected (wrapped with $$), and the rest of the cases are regarded as formulas that do not conform to latex syntax. + +* is_number: its function is to match numbers other than formulas. Only the numbers between two Chinese characters will be corrected, and the rest of the cases are regarded as formulas that do not conform to latex syntax. + +2. Match latex formula + +* If Chinese characters appear in latex, print warning only once. + +* Use _is_formula_legal function, check the completeness and analyzability of latex formula, and report an error for formulas that do not conform to latex syntax. + +Import modules +>>>>>>>>>>>>>>>>>>> + +:: + + from EduNLP.SIF.Parser import Parser + +Input +>>>>>>> + +Types: str + +Content: question text + +:: + + >>> text1 = '生产某种零件的A工厂25名工人的日加工零件数_ _' + >>> text2 = 'X的分布列为( )' + >>> text3 = '① AB是⊙O的直径,AC是⊙O的切线,BC交⊙O于点E.AC的中点为D' + >>> text4 = '支持公式如$\\frac{y}{x}$,$\\SIFBlank$,$\\FigureID{1}$,不支持公式如$\\frac{ \\dddot y}{x}$' + +Parsing +>>>>>>>>>>>>>>>>>>>> + +:: + + >>> text_parser1 = Parser(text1) + >>> text_parser2 = Parser(text2) + >>> text_parser3 = Parser(text3) + >>> text_parser4 = Parser(text4) + +Related parameters description +>>>>>>>>>>>> + +- Try to convert text to standard format + +:: + + >>> text_parser1.description_list() + >>> print('text_parser1.text:',text_parser1.text) + text_parser1.text: 生产某种零件的$A$工厂$25$名工人的日加工零件数$\SIFBlank$ + >>> text_parser2.description_list() + >>> print('text_parser2.text:',text_parser2.text) + text_parser2.text: $X$的分布列为$\SIFChoice$ + +- Determine if the text has syntax errors + +:: + + >>> text_parser3.description_list() + >>> print('text_parser3.error_flag: ',text_parser3.error_flag) + text_parser3.error_flag: 1 + >>> text_parser4.description_list() + >>> print('text_parser4.fomula_illegal_flag: ',text_parser4.fomula_illegal_flag) + text_parser4.fomula_illegal_flag: 1 + diff --git a/docs/source/tutorial/en/parse/FormulaSyntaxStructureParsing.rst b/docs/source/tutorial/en/parse/FormulaSyntaxStructureParsing.rst new file mode 100644 index 00000000..c09da64b --- /dev/null +++ b/docs/source/tutorial/en/parse/FormulaSyntaxStructureParsing.rst @@ -0,0 +1,168 @@ +Formula syntax structure parsing +---------------------------------- + +This section is mainly realized by EduNLP. Formula modules, which can determine if the text has syntax errors and convert the syntax formula into the form of ast tree. In practice, this module is often used as part of an intermediate process, and the relevant parameters of this module can be automatically chosen by calling the corresponding model, so it generally does not need special attention. + +Introduction of Main Content ++++++++++++++++++++++++++++++++++++++ + +1.Formula: determine whether the single formula passed in is in str form. If so, use the ast method for processing, otherwise an error will be reported. In addition, parameter variable_standardization is given. If this parameter is true, the variable standardization method will be used to make sure the same variable has the same variable number. + +2.FormulaGroup: If you need to pass in a formula set, you can call this interface to get an ast forest. The tree structure in the forest is the same as that of Formula. + +Formula +>>>>>>>>>>>> + +Formula: firstly, in the word segmentation function, the formula of the original text is segmented. In addition, ``Formula parse tree`` function is provided, which can represent the abstract syntax analysis tree of mathematical formula in the form of text or picture. + +This module also provides the function of formula variable standardization, such as determining whether 'x' in several sub formulas is the same variable. + +Initialization +++++++++++++++++++++ + +Incoming parameters: item + +Item is the latex formula or the abstract syntax parse tree generated after the formula is parsed and its type is str or List[Dict]. + +:: + + >>> f=Formula("x^2 + x+1 = y") + >>> f + + +View the specific content after formula segmentation ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +- View node elements after formula segmentation + +:: + + >>> f.elements + [{'id': 0, 'type': 'supsub', 'text': '\\supsub', 'role': None}, + {'id': 1, 'type': 'mathord', 'text': 'x', 'role': 'base'}, + {'id': 2, 'type': 'textord', 'text': '2', 'role': 'sup'}, + {'id': 3, 'type': 'bin', 'text': '+', 'role': None}, + {'id': 4, 'type': 'mathord', 'text': 'x', 'role': None}, + {'id': 5, 'type': 'bin', 'text': '+', 'role': None}, + {'id': 6, 'type': 'textord', 'text': '1', 'role': None}, + {'id': 7, 'type': 'rel', 'text': '=', 'role': None}, + {'id': 8, 'type': 'mathord', 'text': 'y', 'role': None}] + +- View the abstract parsing tree of formulas + +:: + + >>> f.ast + [{'val': {'id': 0, 'type': 'supsub', 'text': '\\supsub', 'role': None}, + 'structure': {'bro': [None, 3],'child': [1, 2],'father': None,'forest': None}}, + {'val': {'id': 1, 'type': 'mathord', 'text': 'x', 'role': 'base'}, + 'structure': {'bro': [None, 2], 'child': None, 'father': 0, 'forest': None}}, + {'val': {'id': 2, 'type': 'textord', 'text': '2', 'role': 'sup'}, + 'structure': {'bro': [1, None], 'child': None, 'father': 0, 'forest': None}}, + {'val': {'id': 3, 'type': 'bin', 'text': '+', 'role': None}, + 'structure': {'bro': [0, 4], 'child': None, 'father': None, 'forest': None}}, + {'val': {'id': 4, 'type': 'mathord', 'text': 'x', 'role': None}, + 'structure': {'bro': [3, 5], 'child': None, 'father': None, 'forest': None}}, + {'val': {'id': 5, 'type': 'bin', 'text': '+', 'role': None}, + 'structure': {'bro': [4, 6], 'child': None, 'father': None, 'forest': None}}, + {'val': {'id': 6, 'type': 'textord', 'text': '1', 'role': None}, + 'structure': {'bro': [5, 7], 'child': None, 'father': None, 'forest': None}}, + {'val': {'id': 7, 'type': 'rel', 'text': '=', 'role': None}, + 'structure': {'bro': [6, 8], 'child': None, 'father': None, 'forest': None}}, + {'val': {'id': 8, 'type': 'mathord', 'text': 'y', 'role': None}, + 'structure': {'bro': [7, None],'child': None,'father': None,'forest': None}}] + + >>> print('nodes: ',f.ast_graph.nodes) + nodes: [0, 1, 2, 3, 4, 5, 6, 7, 8] + >>> print('edges: ' ,f.ast_graph.edges) + edges: [(0, 1), (0, 2)] + +- show the abstract parse tree by a picture + +:: + + >>> ForestPlotter().export(f.ast_graph, root_list=[node["val"]["id"] for node in f.ast if node["structure"]["father"] is None],) + >>> plt.show() + +.. figure:: ../../../_static/formula.png + +Variable Standardization ++++++++++++++++++++++++++++++++++ + +This parameter makes the same variable have the same variable number. + +For example: the number of variable ``x`` is ``0`` and the number of variable ``y`` is ``1``. + +:: + + >>> f.variable_standardization().elements + [{'id': 0, 'type': 'supsub', 'text': '\\supsub', 'role': None}, + {'id': 1, 'type': 'mathord', 'text': 'x', 'role': 'base', 'var': 0}, + {'id': 2, 'type': 'textord', 'text': '2', 'role': 'sup'}, + {'id': 3, 'type': 'bin', 'text': '+', 'role': None}, + {'id': 4, 'type': 'mathord', 'text': 'x', 'role': None, 'var': 0}, + {'id': 5, 'type': 'bin', 'text': '+', 'role': None}, + {'id': 6, 'type': 'textord', 'text': '1', 'role': None}, + {'id': 7, 'type': 'rel', 'text': '=', 'role': None}, + {'id': 8, 'type': 'mathord', 'text': 'y', 'role': None, 'var': 1}] + +FormulaGroup +>>>>>>>>>>>>>>> + +Call ``FormulaGroup`` class to parse the equations. The related attributes and functions are the same as those above. + +:: + + >>> fs = FormulaGroup(["x^2 = y", "x^3 = y^2", "x + y = \pi"]) + >>> fs + ;;> + >>> fs.elements + [{'id': 0, 'type': 'supsub', 'text': '\\supsub', 'role': None}, + {'id': 1, 'type': 'mathord', 'text': 'x', 'role': 'base'}, + {'id': 2, 'type': 'textord', 'text': '2', 'role': 'sup'}, + {'id': 3, 'type': 'rel', 'text': '=', 'role': None}, + {'id': 4, 'type': 'mathord', 'text': 'y', 'role': None}, + {'id': 5, 'type': 'supsub', 'text': '\\supsub', 'role': None}, + {'id': 6, 'type': 'mathord', 'text': 'x', 'role': 'base'}, + {'id': 7, 'type': 'textord', 'text': '3', 'role': 'sup'}, + {'id': 8, 'type': 'rel', 'text': '=', 'role': None}, + {'id': 9, 'type': 'supsub', 'text': '\\supsub', 'role': None}, + {'id': 10, 'type': 'mathord', 'text': 'y', 'role': 'base'}, + {'id': 11, 'type': 'textord', 'text': '2', 'role': 'sup'}, + {'id': 12, 'type': 'mathord', 'text': 'x', 'role': None}, + {'id': 13, 'type': 'bin', 'text': '+', 'role': None}, + {'id': 14, 'type': 'mathord', 'text': 'y', 'role': None}, + {'id': 15, 'type': 'rel', 'text': '=', 'role': None}, + {'id': 16, 'type': 'mathord', 'text': '\\pi', 'role': None}] + >>> fs.ast + [{'val': {'id': 0, 'type': 'supsub', 'text': '\\supsub', 'role': None}, + 'structure': {'bro': [None, 3], + 'child': [1, 2], + 'father': None, + 'forest': None}}, + {'val': {'id': 1, 'type': 'mathord', 'text': 'x', 'role': 'base'}, + 'structure': {'bro': [None, 2], + 'child': None, + 'father': 0, + 'forest': [6, 12]}}, + {'val': {'id': 2, 'type': 'textord', 'text': '2', 'role': 'sup'}, + 'structure': {'bro': [1, None], 'child': None, 'father': 0, 'forest': None}}, + {'val': {'id': 3, 'type': 'rel', 'text': '=', 'role': None}, + 'structure': {'bro': [0, 4], 'child': None, 'father': None, 'forest': None}}, + {'val': {'id': 4, 'type': 'mathord', 'text': 'y', 'role': None}, + 'structure': {'bro': [3, None], + 'child': None, + 'father': None, + 'forest': [10, 14]}}, + {'val': {'id': 5, 'type': 'supsub', 'text': '\\supsub', 'role': None}, + 'structure': {'bro': [None, 8], + 'child': [6, 7], + 'father': None, + 'forest': None}}, + {'val': {'id': 6, 'type': 'mathord', 'text': 'x', 'role': 'base'}, + show more (open the raw output data in a text editor) ... + >>> fs.variable_standardization()[0] + [{'id': 0, 'type': 'supsub', 'text': '\\supsub', 'role': None}, {'id': 1, 'type': 'mathord', 'text': 'x', 'role': 'base', 'var': 0}, {'id': 2, 'type': 'textord', 'text': '2', 'role': 'sup'}, {'id': 3, 'type': 'rel', 'text': '=', 'role': None}, {'id': 4, 'type': 'mathord', 'text': 'y', 'role': None, 'var': 1}] + >>> ForestPlotter().export(fs.ast_graph, root_list=[node["val"]["id"] for node in fs.ast if node["structure"]["father"] is None],) + +.. figure:: ../../../_static/formulagroup.png diff --git a/docs/source/tutorial/en/parse/TextSyntaxStructureParsing.rst b/docs/source/tutorial/en/parse/TextSyntaxStructureParsing.rst new file mode 100644 index 00000000..6822c961 --- /dev/null +++ b/docs/source/tutorial/en/parse/TextSyntaxStructureParsing.rst @@ -0,0 +1,72 @@ +Text syntax structure parsing +-------------------------------- + +This section is mainly realized by EduNLP.SIF.Parse module. Its main function is to extract letters and numbers in the text and convert them into standard format. + +This module is mainly used as an *middle module* to parse the input text. In general, users do not call this module directly. + +Introduction of Main Content ++++++++++++++++++++++++++++++++++++++ + +1. Judge the type of the incoming text in the following order + +* is_chinese: its function is to match Chinese characters[\u4e00-\u9fa5]. + +* is_alphabet: its function is to match alphabets other than formulas. Only the alphabets between two Chinese characters will be corrected (wrapped with $$), and the rest of the cases are regarded as formulas that do not conform to latex syntax. + +* is_number: its function is to match numbers other than formulas. Only the numbers between two Chinese characters will be corrected, and the rest of the cases are regarded as formulas that do not conform to latex syntax. + +2. Match latex formula + +* If Chinese characters appear in latex, print warning only once. + +* Use _is_formula_legal function, check the completeness and analyzability of latex formula, and report an error for formulas that do not conform to latex syntax. + +Input +>>>>>>> + +Type: str + +Content:question text + +:: + + >>> text1 = '生产某种零件的A工厂25名工人的日加工零件数_ _' + >>> text2 = 'X的分布列为( )' + >>> text3 = '① AB是⊙O的直径,AC是⊙O的切线,BC交⊙O于点E.AC的中点为D' + >>> text4 = '支持公式如$\\frac{y}{x}$,$\\SIFBlank$,$\\FigureID{1}$,不支持公式如$\\frac{ \\dddot y}{x}$' + +Parsing +>>>>>>>>>>>>>>>>>>>> + +:: + + >>> text_parser1 = Parser(text1) + >>> text_parser2 = Parser(text2) + >>> text_parser3 = Parser(text3) + >>> text_parser4 = Parser(text4) + +Related parameters description(?) +>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> + +- Try to convert text to standard format + +:: + + >>> text_parser1.description_list() + >>> print('text_parser1.text:',text_parser1.text) + text_parser1.text: 生产某种零件的$A$工厂$25$名工人的日加工零件数$\SIFBlank$ + >>> text_parser2.description_list() + >>> print('text_parser2.text:',text_parser2.text) + text_parser2.text: $X$的分布列为$\SIFChoice$ + +- Determine if the text has syntax errors + +:: + + >>> text_parser3.description_list() + >>> print('text_parser3.error_flag: ',text_parser3.error_flag) + text_parser3.error_flag: 1 + >>> text_parser4.description_list() + >>> print('text_parser4.fomula_illegal_flag: ',text_parser4.fomula_illegal_flag) + text_parser4.fomula_illegal_flag: 1 diff --git a/docs/source/tutorial/en/pretrain.rst b/docs/source/tutorial/en/pretrain.rst new file mode 100644 index 00000000..58105f44 --- /dev/null +++ b/docs/source/tutorial/en/pretrain.rst @@ -0,0 +1,130 @@ +Pre-training +============== + +In the field of NLP, Pre-trained Language Models has become a very important basic technology. +In this chapter, we will introduce the pre training tools in EduNLP: + +* How to train with a corpus to get a pre-trained model +* How to load the pre-trained model +* Public pre-trained models + +Import modules +--------------- + +:: + + from EduNLP.I2V import get_pretrained_i2v + from EduNLP.Vector import get_pretrained_t2v + +Train the Model +------------------ + +Call train_Vector function interface directly to make the training model easier. This section calls the relevant training models in the gensim library. At present, the training methods of "sg"、 "cbow"、 "fastext"、 "d2v"、 "bow"、 "tfidf" are provided. Parameter embedding_dim is also provided for users to determine vector dimension according to their needs. + +Basic Steps +################## + +1.Determine the type of model and select the appropriate tokenizer (GensimWordTokenizer、 GensimSegTokenizer) to finish tokenization. + +2.Call train_vector function to get the required pre-trained model。 + +Examples: + +:: + + >>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True) + >>> token_item = tokenizer("有公式$\\FormFigureID{wrong1?}$,如图$\\FigureID{088f15ea-xxx}$,\ + ... 若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$") + >>> print(token_item.tokens[:10]) + ['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]'] + + # 10 dimension with fasstext method + train_vector(sif_items, "../../../data/w2v/gensim_luna_stem_tf_", 10, method="d2v") + + +Load models +---------------- + +Transfer the obtained model to the I2V module to load the model. + +Examples: + +:: + + >>> model_path = "../test_model/test_gensim_luna_stem_tf_d2v_256.bin" + >>> i2v = D2V("text","d2v",filepath=model_path, pretrained_t2v = False) + +The overview of our public model +------------------------------------ + +Version description +####################### + +First level version: + +* Public version 1 (luna_pub): college entrance examination +* Public version 2 (luna_pub_large): college entrance examination + regional examination + +Second level version: + +* Single subject(Chinese,Math,English,History,Geography,Politics,Biology,Physics,Chemistry) +* Multiple subject(science, arts and all subject) + +Third level version【to be finished】: + +* Don't use third-party initializers +* Use third-party initializers + +Description of train data in models +############################################## + +* Currently, the data used in w2v and d2v models are the subjects of senior high school. +* test data:`[OpenLUNA.json] `_ + +At present, the following models are provided. More models of different subjects and question types are being trained. Please look forward to it. + "d2v_all_256" (all subject), "d2v_sci_256" (Science), "d2v_eng_256" (English),"d2v_lit_256" (Arts) + + +Examples of Model Training +------------------------------------ + +Get the dataset +#################### + +.. toctree:: + :maxdepth: 1 + :titlesonly: + + prepare_dataset <../../build/blitz/pretrain/prepare_dataset.ipynb> + +An example of d2v in gensim model +################################## + +.. toctree:: + :maxdepth: 1 + :titlesonly: + + d2v_bow_tfidf <../../build/blitz/pretrain/gensim/d2v_bow_tfidf.ipynb> + d2v_general <../../build/blitz/pretrain/gensim/d2v_general.ipynb> + d2v_stem_tf <../../build/blitz/pretrain/gensim/d2v_stem_tf.ipynb> + +An example of w2v in gensim model +################################## + +.. toctree:: + :maxdepth: 1 + :titlesonly: + + w2v_stem_text <../../build/blitz/pretrain/gensim/w2v_stem_text.ipynb> + w2v_stem_tf <../../build/blitz/pretrain/gensim/w2v_stem_tf.ipynb> + +An example of seg_token +############################# + +.. toctree:: + :maxdepth: 1 + :titlesonly: + + d2v.ipynb <../../build/blitz/pretrain/seg_token/d2v.ipynb> + d2v_d1 <../../build/blitz/pretrain/seg_token/d2v_d1.ipynb> + d2v_d2 <../../build/blitz/pretrain/seg_token/d2v_d2.ipynb> \ No newline at end of file diff --git a/docs/source/tutorial/en/pretrain/loading.rst b/docs/source/tutorial/en/pretrain/loading.rst new file mode 100644 index 00000000..83b54c39 --- /dev/null +++ b/docs/source/tutorial/en/pretrain/loading.rst @@ -0,0 +1,11 @@ +Load models +---------------- + +Transfer the obtained model to the I2V module to load the model. + +Examples: + +:: + + >>> model_path = "../test_model/test_gensim_luna_stem_tf_d2v_256.bin" + >>> i2v = D2V("text","d2v",filepath=model_path, pretrained_t2v = False) diff --git a/docs/source/tutorial/en/pretrain/pub.rst b/docs/source/tutorial/en/pretrain/pub.rst new file mode 100644 index 00000000..60077309 --- /dev/null +++ b/docs/source/tutorial/en/pretrain/pub.rst @@ -0,0 +1,74 @@ +The overview of our public model +------------------------------------ + + +Version Description +######################### + +First level version: + +* Public version 1 (luna_pub): college entrance examination +* Public version 2 (luna_pub_large): college entrance examination + regional examination + +Second level version: + +* Minor subjects(Chinese,Math,English,History,Geography,Politics,Biology,Physics,Chemistry) +* Major subjects(science, arts and all subject) + +Third level version【to be finished】: + +* Don't use third-party initializers +* Use third-party initializers + +Description of train data in models +####################################### + +* Currently, the data used in w2v and d2v models are the subjects of senior high school. +* test data:`[OpenLUNA.json] `_ + +At present, the following models are provided. More models of different subjects and question types are being trained. Please look forward to it. + "d2v_all_256" (all subject), "d2v_sci_256" (Science), "d2v_eng_256" (English),"d2v_lit_256" (Arts) + +Examples of model training +---------------------------- + +Get the dataset +#################### + +.. toctree:: + :maxdepth: 1 + :titlesonly: + + prepare_dataset <../../../build/blitz/pretrain/prepare_dataset.ipynb> + +An example of d2v in gensim model +#################################### + +.. toctree:: + :maxdepth: 1 + :titlesonly: + + d2v_bow_tfidf <../../../build/blitz/pretrain/gensim/d2v_bow_tfidf.ipynb> + d2v_general <../../../build/blitz/pretrain/gensim/d2v_general.ipynb> + d2v_stem_tf <../../../build/blitz/pretrain/gensim/d2v_stem_tf.ipynb> + +An example of w2v in gensim model +#################################### + +.. toctree:: + :maxdepth: 1 + :titlesonly: + + w2v_stem_text <../../../build/blitz/pretrain/gensim/w2v_stem_text.ipynb> + w2v_stem_tf <../../../build/blitz/pretrain/gensim/w2v_stem_tf.ipynb> + +An example of seg_token +############################ + +.. toctree:: + :maxdepth: 1 + :titlesonly: + + d2v.ipynb <../../../build/blitz/pretrain/seg_token/d2v.ipynb> + d2v_d1 <../../../build/blitz/pretrain/seg_token/d2v_d1.ipynb> + d2v_d2 <../../../build/blitz/pretrain/seg_token/d2v_d2.ipynb> diff --git a/docs/source/tutorial/en/pretrain/start.rst b/docs/source/tutorial/en/pretrain/start.rst new file mode 100644 index 00000000..4aa91619 --- /dev/null +++ b/docs/source/tutorial/en/pretrain/start.rst @@ -0,0 +1,24 @@ +Train the model +------------------ + +Call train_Vector function interface directly to make the training model easier. This section calls the relevant training models in the gensim library. At present, the training methods of "sg"、 "cbow"、 "fastext"、 "d2v"、 "bow"、 "tfidf" are provided. Parameter embedding_dim is also provided for users to determine vector dimension according to their needs. + +Basic Steps +################## + +1.Determine the type of model and select the appropriate tokenizer (GensimWordTokenizer、 GensimSegTokenizer) to finish tokenization. + +2.Call train_vector function to get the required pre-trained model。 + +Examples: + +:: + + >>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True) + >>> token_item = tokenizer("有公式$\\FormFigureID{wrong1?}$,如图$\\FigureID{088f15ea-xxx}$,\ + ... 若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$") + >>> print(token_item.tokens[:10]) + ['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]'] + + # 10 dimension with fasstext method + train_vector(sif_items, "../../../data/w2v/gensim_luna_stem_tf_", 10, method="d2v") diff --git a/docs/source/tutorial/en/seg.rst b/docs/source/tutorial/en/seg.rst new file mode 100644 index 00000000..4e2f2d39 --- /dev/null +++ b/docs/source/tutorial/en/seg.rst @@ -0,0 +1,187 @@ +Component Segmentation +========================= + +Educational resource is a kind of multimodal data, including data such as text, pictures, formulas and so on. +At the same time, it may also contain different components semantically, such as question stems, options, etc. Therefore, we first need to identify and segment the different components of educational resources: + +* Semantic Component Segmentation +* Structural Component Segmentation + +Main Processing Contents +--------------------------- + +1. Convert multiple-choice questions in the form of dict to qualified item by `Syntax parsing `_; + +2. The input items are segmented and grouped according to the element type. + +Semantic Component Segmentation +--------------------------------- + +Because multiple-choice questions are given in the form of dict, it is necessary to convert them into text format while retaining their data relationship. This function can be realized by dict2str4sif function which can convert multiple-choice question items into character format and identify question stem and options。 + +Import Modules ++++++++++++++++++++++++ + +:: + + from EduNLP.utils import dict2str4sif + +Basic Usage +++++++++++++++++++ + +:: + + >>> item = { + ... "stem": r"若复数$z=1+2 i+i^{3}$,则$|z|=$", + ... "options": ['0', '1', r'$\sqrt{2}$', '2'], + ... } + >>> dict2str4sif(item) # doctest: +ELLIPSIS + '$\\SIFTag{stem_begin}$若复数$z=1+2 i+i^{3}$,则$|z|=$$\\SIFTag{stem_end}$$\\SIFTag{options_begin}$$\\SIFTag{list_0}$0$\\SIFTag{list_1}$1$\\SIFTag{list_2}$$\\sqrt{2}$$\\SIFTag{list_3}$2$\\SIFTag{options_end}$' + +Optional additional parameters / interfaces +++++++++++++++++++++++++++++++++++++++++++++++++++ + +1.add_list_no_tag: if this parameter is true, it means that you need to count the labels in the options section. + +:: + + >>> dict2str4sif(item, add_list_no_tag=True) # doctest: +ELLIPSIS + '$\\SIFTag{stem_begin}$若复数$z=1+2 i+i^{3}$,则$|z|=$$\\SIFTag{stem_end}$$\\SIFTag{options_begin}$$\\SIFTag{list_0}$0$\\SIFTag{list_1}$1$\\SIFTag{list_2}$$\\sqrt{2}$$\\SIFTag{list_3}$2$\\SIFTag{options_end}$' + + >>> dict2str4sif(item, add_list_no_tag=False) # doctest: +ELLIPSIS + '$\\SIFTag{stem_begin}$若复数$z=1+2 i+i^{3}$,则$|z|=$$\\SIFTag{stem_end}$$\\SIFTag{options_begin}$0$\\SIFSep$1$\\SIFSep$$\\sqrt{2}$$\\SIFSep$2$\\SIFTag{options_end}$' + +2.tag_mode: The location for the label can be selected using this parameter. 'delimiter' is to label both the beginning and the end,'head' is to label only the head, and 'tail' is to label only the tail. + +:: + + >>> dict2str4sif(item, tag_mode="head") # doctest: +ELLIPSIS + '$\\SIFTag{stem}$若复数$z=1+2 i+i^{3}$,则$|z|=$$\\SIFTag{options}$$\\SIFTag{list_0}$0$\\SIFTag{list_1}$1$\\SIFTag{list_2}$$\\sqrt{2}$$\\SIFTag{list_3}$2' + + >>> dict2str4sif(item, tag_mode="tail") # doctest: +ELLIPSIS + '若复数$z=1+2 i+i^{3}$,则$|z|=$$\\SIFTag{stem}$$\\SIFTag{list_0}$0$\\SIFTag{list_1}$1$\\SIFTag{list_2}$$\\sqrt{2}$$\\SIFTag{list_3}$2$\\SIFTag{options}$' + +3.key_as_tag: If this parameter is false, this process will only adds $\SIFSep$ between the options without distinguishing the type of segmentation label. + +:: + + >>> dict2str4sif(item, key_as_tag=False) + '若复数$z=1+2 i+i^{3}$,则$|z|=$0$\\SIFSep$1$\\SIFSep$$\\sqrt{2}$$\\SIFSep$2' + +Structural Component Segmentation +------------------------------------------ + +This step is to segment sliced items. In this step, there is a depth option. You can select all positions or some labels for segmentation according to your needs, such as \SIFSep and \SIFTag. You can also select where to add labels, either at the head and tail or only at the head or tail. + + +There are two modes: + +* linear mode: it is used for text processing (word segmentation using jieba library); + +* ast mode: it is used to parse the formula. + +Basic Segmentation process: + +- Match components with regular expression matching + +- Process the components with special structures, such as converting the base64 encoded picture to numpy form + +- Classify the elements into each element group + +- Enter the corresponding parameters as required to get the filtered results + +Import Modules ++++++++++ + +:: + + from EduNLP.SIF.segment import seg + from EduNLP.SIF import sif4sci + +Basic Usage +++++++++++++++++++ + +:: + + >>> test_item = r"如图所示,则$\bigtriangleup ABC$的面积是$\SIFBlank$。$\FigureID{1}$" + >>> seg(test_item) + >>> ['如图所示,则', '\\bigtriangleup ABC', '的面积是', '\\SIFBlank', '。', \FigureID{1}] + +Optional additional parameters/interfaces +++++++++++++++++++++++ + +1.describe: count the number of elements of different types + +:: + + >>> s.describe() + {'t': 3, 'f': 1, 'g': 1, 'm': 1} + +2.filter: this interface can screen out one or more types of elements. + +Using this interface, you can pass in a "keep" parameter or a special character directly to choose what type of elements to retain. + +Element type represented by symbol: + +- "t": text +- "f": formula +- "g": figure +- "m": question mark +- "a": tag +- "s": sep tag + +:: + + >>> with s.filter("f"): + ... s + ['如图所示,则', '的面积是', '\\SIFBlank', '。', \FigureID{1}] + >>> with s.filter(keep="t"): + ... s + ['如图所示,则', '的面积是', '。'] + +3.symbol: this interface can convert some types of data into special symbols + +Element type represented by symbol: + +- "t": text +- "f": formula +- "g": figure +- "m": question mark + +:: + + >>> seg(test_item, symbol="fgm") + ['如图所示,则', '[FORMULA]', '的面积是', '[MARK]', '。', '[FIGURE]'] + >>> seg(test_item, symbol="tfgm") + ['[TEXT]', '[FORMULA]', '[TEXT]', '[MARK]', '[TEXT]', '[FIGURE]'] + +In addition,sif4sci function is also provided, which can easily convert items into the result processed by Structural Component Segmentation + +:: + + >>> segments = sif4sci(item["stem"], figures=figures, tokenization=False) + >>> segments + ['如图来自古希腊数学家希波克拉底所研究的几何图形.此图由三个半圆构成,三个半圆的直径分别为直角三角形', 'ABC', '的斜边', 'BC', ', 直角边', 'AB', ', ', 'AC', '.', '\\bigtriangleup ABC', '的三边所围成的区域记为', 'I', ',黑色部分记为', 'II', ', 其余部分记为', 'III', '.在整个图形中随机取一点,此点取自', 'I,II,III', '的概率分别记为', 'p_1,p_2,p_3', ',则', '\\SIFChoice', \FigureID{1}] + +- When calling this function, you can selectively output a certain type of data according to your needs + +:: + + >>> segments.formula_segments + ['ABC', + 'BC', + 'AB', + 'AC', + '\\bigtriangleup ABC', + 'I', + 'II', + 'III', + 'I,II,III', + 'p_1,p_2,p_3'] + +- Similar to seg function, sif4sci function also provides depth options to help with your research ----- By modifying the ``symbol`` parameter, different components can be transformed into specific markers. + +:: + + >>> sif4sci(item["stem"], figures=figures, tokenization=False, symbol="tfgm") + ['[TEXT]', '[FORMULA]', '[TEXT]', '[FORMULA]', '[TEXT]', '[FORMULA]', '[TEXT]', '[FORMULA]', '[TEXT]', '[FORMULA]', '[TEXT]', '[FORMULA]', '[TEXT]', '[FORMULA]', '[TEXT]', '[FORMULA]', '[TEXT]', '[FORMULA]', '[TEXT]', '[FORMULA]', '[TEXT]', '[MARK]', '[FIGURE]'] diff --git a/docs/source/tutorial/en/seg/SemanticComponentSegmentation.rst b/docs/source/tutorial/en/seg/SemanticComponentSegmentation.rst new file mode 100644 index 00000000..c6535941 --- /dev/null +++ b/docs/source/tutorial/en/seg/SemanticComponentSegmentation.rst @@ -0,0 +1,47 @@ +Semantic Component Segmentation +------------------------------------ + +Because multiple-choice questions are given in the form of dict, it is necessary to convert them into text format while retaining their data relationship. This function can be realized by dict2str4sif function which can convert multiple-choice question items into character format and identify question stem and options。 + + +Basic Usage +++++++++++++++++++ + +:: + + >>> item = { + ... "stem": r"若复数$z=1+2 i+i^{3}$,则$|z|=$", + ... "options": ['0', '1', r'$\sqrt{2}$', '2'], + ... } + >>> dict2str4sif(item) # doctest: +ELLIPSIS + '$\\SIFTag{stem_begin}$若复数$z=1+2 i+i^{3}$,则$|z|=$$\\SIFTag{stem_end}$$\\SIFTag{options_begin}$$\\SIFTag{list_0}$0$\\SIFTag{list_1}$1$\\SIFTag{list_2}$$\\sqrt{2}$$\\SIFTag{list_3}$2$\\SIFTag{options_end}$' + +Optional additional parameters / interfaces +++++++++++++++++++++++++++++++++++++++++++++++++ + +1.add_list_no_tag: if this parameter is true, it means that you need to count the labels in the options section. + +:: + + >>> dict2str4sif(item, add_list_no_tag=True) # doctest: +ELLIPSIS + '$\\SIFTag{stem_begin}$若复数$z=1+2 i+i^{3}$,则$|z|=$$\\SIFTag{stem_end}$$\\SIFTag{options_begin}$$\\SIFTag{list_0}$0$\\SIFTag{list_1}$1$\\SIFTag{list_2}$$\\sqrt{2}$$\\SIFTag{list_3}$2$\\SIFTag{options_end}$' + + >>> dict2str4sif(item, add_list_no_tag=False) # doctest: +ELLIPSIS + '$\\SIFTag{stem_begin}$若复数$z=1+2 i+i^{3}$,则$|z|=$$\\SIFTag{stem_end}$$\\SIFTag{options_begin}$0$\\SIFSep$1$\\SIFSep$$\\sqrt{2}$$\\SIFSep$2$\\SIFTag{options_end}$' + +2.tag_mode: The location for the label can be selected using this parameter. 'delimiter' is to label both the beginning and the end,'head' is to label only the head, and 'tail' is to label only the tail. + +:: + + >>> dict2str4sif(item, tag_mode="head") # doctest: +ELLIPSIS + '$\\SIFTag{stem}$若复数$z=1+2 i+i^{3}$,则$|z|=$$\\SIFTag{options}$$\\SIFTag{list_0}$0$\\SIFTag{list_1}$1$\\SIFTag{list_2}$$\\sqrt{2}$$\\SIFTag{list_3}$2' + + >>> dict2str4sif(item, tag_mode="tail") # doctest: +ELLIPSIS + '若复数$z=1+2 i+i^{3}$,则$|z|=$$\\SIFTag{stem}$$\\SIFTag{list_0}$0$\\SIFTag{list_1}$1$\\SIFTag{list_2}$$\\sqrt{2}$$\\SIFTag{list_3}$2$\\SIFTag{options}$' + +3.key_as_tag: If this parameter is false, this process will only adds $\SIFSep$ between the options without distinguishing the type of segmentation label. + +:: + + >>> dict2str4sif(item, key_as_tag=False) + '若复数$z=1+2 i+i^{3}$,则$|z|=$0$\\SIFSep$1$\\SIFSep$$\\sqrt{2}$$\\SIFSep$2' \ No newline at end of file diff --git a/docs/source/tutorial/en/seg/StructuralComponentSegmentation.rst b/docs/source/tutorial/en/seg/StructuralComponentSegmentation.rst new file mode 100644 index 00000000..f5c44f7e --- /dev/null +++ b/docs/source/tutorial/en/seg/StructuralComponentSegmentation.rst @@ -0,0 +1,67 @@ +Structural Component Segmentation +------------------------------------ + +This step is to segment sliced items. In this step, there is a depth option. You can select all positions or some labels for segmentation according to your needs, such as \SIFSep and \SIFTag. You can also select where to add labels, either at the head and tail or only at the head or tail. + + +There are two modes: + +* linear mode: it is used for text processing (word segmentation using jieba library); + +* ast mode: it is used to parse the formula. + +Basic Usage +++++++++++++++++++ + +:: + + >>> test_item = r"如图所示,则$\bigtriangleup ABC$的面积是$\SIFBlank$。$\FigureID{1}$" + >>> seg(test_item) + >>> ['如图所示,则', '\\bigtriangleup ABC', '的面积是', '\\SIFBlank', '。', \FigureID{1}] + +Optional additional parameters/interfaces ++++++++++++++++++++++++++++++++++++++++++++++ + +1.describe: count the number of elements of different types + +:: + + >>> s.describe() + {'t': 3, 'f': 1, 'g': 1, 'm': 1} + +2.filter: this interface can screen out one or more types of elements. + +Using this interface, you can pass in a "keep" parameter or a special character directly to choose what type of elements to retain. + +Element type represented by symbol: + "t": text + "f": formula + "g": figure + "m": question mark + "a": tag + "s": sep tag + +:: + + >>> with s.filter("f"): + ... s + ['如图所示,则', '的面积是', '\\SIFBlank', '。', \FigureID{1}] + >>> with s.filter(keep="t"): + ... s + ['如图所示,则', '的面积是', '。'] + +3.symbol: this interface can convert some types of data into special symbols + +Element type represented by symbol: + +- "t": text +- "f": formula +- "g": figure +- "m": question mark + +:: + + >>> seg(test_item, symbol="fgm") + ['如图所示,则', '[FORMULA]', '的面积是', '[MARK]', '。', '[FIGURE]'] + >>> seg(test_item, symbol="tfgm") + ['[TEXT]', '[FORMULA]', '[TEXT]', '[MARK]', '[TEXT]', '[FIGURE]'] diff --git a/docs/source/tutorial/en/sif.rst b/docs/source/tutorial/en/sif.rst index 0cbe7cf3..7650aae6 100644 --- a/docs/source/tutorial/en/sif.rst +++ b/docs/source/tutorial/en/sif.rst @@ -1,2 +1,145 @@ Standard Item Format -==================== +======================= + +version: 0.2 + +For the convenience of follow-up research and use, we need a unified test question grammar standard. + +Grammar Rules +---------------- + +1. Only Chinese characters, Chinese and English punctuation and line breaks are allowed in the question text. + +2. Represent underlines of blanks and brackets of choices with ``\$\SIFBlank\$`` and ``\$\SIFChoice\$`` respectively. + +3. We use ``$\FigureID{ uuid }$`` or Base64 to represent pictures. Especially, ``$\FormFigureID{ uuid }$`` is used to represent formulas pictures. + +4. Text format description: we represent text in different styles with ``$\textf{item,CHAR_EN}$``. Currently, we have defined some styles: b-bold, i-italic, u-underline, w-wave, d-dotted, t-title. CHAR_EN Labels can be mixed and sorted alphabetically. An example: $\textf{EduNLP, b}$ looks **EduNLP** + +5. Other mathematical symbols like English letters, Roman characters and numbers need to be expressed in latex format, that is, embedded in ``$$`` . + +6. For the entry standard of molecular formula, please refer to `INCHI `_ for the time being. + +7. Currently, there are no requirements for latex internal syntax. + +:: + + 1. Item -> CHARACTER|EN_PUN_LIST|CH_PUN_LIST|FORMULA|QUES_MARK + 2. EN_PUN_LIST -> [',', '.', '?', '!', ':', ';', '\'', '\"', '(', ')', ' ','_','/','|','\\','<','>','[',']','-'] + 3. CH_PUN_LIST -> [',', '。', '!', '?', ':',';', '‘', '’', '“', '”', '(', ')', ' ', '、','《','》','—','.'] + 4. FORMULA -> $latex formula$ | $\FormFigureID{UUID}$ | $\FormFigureBase64{BASE64}$ + 5. FIGURE -> $\FigureID{UUID}$ | $\FigureBase64{BASE64}$ + 6. UUID -> [a-zA-Z\-0-9]+ + 7. CHARACTER -> CHAR_EN | CHAR_CH + 8. CHAR_EN -> [a-zA-Z]+ + 9. CHAR_CH -> [\u4e00-\u9fa5]+ + 10. DIGITAL -> [0-9]+ + 11. QUES_MARK -> $\SIFBlank$ | $\SIFChoice$ + + +Tips ++++++++++++++++ + +1. Reserved characters and escape characters. + +2. Numbers. + +3. Choices and blanks. + +4. A single number or letter is also required to be between ``$$`` (automatic verification could already realize it). + +5. Try to make sure Chinese is not included in the latex formula such as ``\text{CHAR_CH}``. + +6. When importing data using MySQL database, an ``\`` is automatically ignored which needs to be further processed as ``\\``. + +Examples +----------------- + +Standard Format: + +:: + + 1. 若$x,y$满足约束条件$\\left\\{\\begin{array}{c}2 x+y-2 \\leq 0 \\\\ x-y-1 \\geq 0 \\\\ y+1 \\geq 0\\end{array}\\right.$,则$z=x+7 y$的最大值$\\SIFUnderline$' + + 2. 已知函数$f(x)=|3 x+1|-2|x|$画出$y=f(x)$的图像求不等式$f(x)>f(x+1)$的解集$\\PictureID{3bf2ddf4-8af1-11eb-b750-b46bfc50aa29}$$\\PictureID{59b8bd14-8af1-11eb-93a5-b46bfc50aa29}$$\\PictureID{63118b3a-8b75-11eb-a5c0-b46bfc50aa29}$$\\PictureID{6a006179-8b76-11eb-b386-b46bfc50aa29}$$\\PictureID{088f15eb-8b7c-11eb-a86f-b46bfc50aa29}$ + +Non-standard Format: + +1. Letters, numbers and mathematical symbols are mixed: + + For example: + + ``完成下面的2x2列联表,`` + + ``(单位:m3)`` + + ``则输出的n=`` + +2. Some special mathematical symbols are not represented by the latex formula: + + For example: + + ``命题中真命题的序号是 ①`` + + ``AB是⊙O的直径,AC是⊙O的切线,BC交⊙O于点E.若D为AC的中点`` + +3. There are unicode encoded characters in the text: + + For example: + ``则$a$的取值范围是(\u3000\u3000)`` + +Functions for judging whether text is in SIF format and converting to SIF format +-------------------------------------------------------------------------------------------------- + +Import modules +++++++++ +:: + + from EduNLP.SIF import is_sif, to_sif + +is_sif ++++++++++++ + +:: + + >>> text1 = '若$x,y$满足约束条件' + >>> text2 = '$\\left\\{\\begin{array}{c}2 x+y-2 \\leq 0 \\\\ x-y-1 \\geq 0 \\\\ y+1 \\geq 0\\end{array}\\right.$,' + >>> text3 = '则$z=x+7 y$的最大值$\\SIFUnderline$' + >>> text4 = '某校一个课外学习小组为研究某作物的发芽率y和温度x(单位...' + >>> is_sif(text1) + True + >>> is_sif(text2) + True + >>> is_sif(text3) + True + >>> is_sif(text4) + False + +to_sif ++++++++++++ + +:: + + >>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x(单位...' + >>> to_sif(text) + '某校一个课外学习小组为研究某作物的发芽率$y$和温度$x$(单位...' + + +Change Log +---------------- + +2021-05-18 + +Changed + +1. Originally, we use ``\$\SIFUnderline\$`` and ``\$\SIFBracket\$`` to represent underlines of blanks and brackets of choices. Now we represent them with ``\$\SIFBlank\$`` and ``\$\SIFChoice\$``. + +2. Originally, we used ``$\PictureID{ uuid }$`` to represent pictures, but now we use ``$\FigureID{ uuid }$`` instead. Especially, ``$\FormFigureID{ uuid }$`` is used to represent formulas pictures. + +2021-06-28 + +Added: + +1. There should not be line breaks between the notation ``$$``. + +2. Add text format description. diff --git a/docs/source/tutorial/en/tokenization/GensimSegTokenizer.rst b/docs/source/tutorial/en/tokenization/GensimSegTokenizer.rst new file mode 100644 index 00000000..eb624e94 --- /dev/null +++ b/docs/source/tutorial/en/tokenization/GensimSegTokenizer.rst @@ -0,0 +1,9 @@ +GensimSegTokenizer +===================== + +By default, the pictures, separators, blanks in the question text and other parts of the incoming item are converted into special characters for data security and tokenization of text, formulas and labels. Also, the tokenizer uses linear analysis method for text and abstract analysis method of syntax tree for formulas. + +Compared to GensimWordTokenizer, the main differences are: + +* It provides the depth option for segmentation position, such as \SIFSep and \SIFTag. +* By default, labels are inserted in the header of item components (such as text and formula). \ No newline at end of file diff --git a/docs/source/tutorial/en/tokenization/GensimWordTokenizer.rst b/docs/source/tutorial/en/tokenization/GensimWordTokenizer.rst new file mode 100644 index 00000000..98d4b10a --- /dev/null +++ b/docs/source/tutorial/en/tokenization/GensimWordTokenizer.rst @@ -0,0 +1,23 @@ +GensimWordTokenizer +===================== + +By default, the pictures, blanks in the question text and other parts of the incoming item are converted into special characters for data security and the tokenization of text, formulas, labels and separators. Also, the tokenizer uses linear analysis method for text and abstract syntax tree method for formulas respectively. You can choose each of them by ``general`` parameter: + +-true, it means that the incoming item conforms to SIF and the linear analysis method should be used. +-false, it means that the incoming item doesn't conform to SIF and the abstract syntax tree method should be used. + +Examples +---------- + +:: + + >>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True) + >>> token_item = tokenizer("有公式$\\FormFigureID{wrong1?}$,如图$\\FigureID{088f15ea-xxx}$,\ + ... 若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$") + >>> print(token_item.tokens[:10]) + ['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]'] + >>> tokenizer = GensimWordTokenizer(symbol="fgmas", general=False) + >>> token_item = tokenizer("有公式$\\FormFigureID{wrong1?}$,如图$\\FigureID{088f15ea-xxx}$,\ + ... 若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$") + >>> print(token_item.tokens[:10]) + ['公式', '[FORMULA]', '如图', '[FIGURE]', '[FORMULA]', '约束条件', '公式', '[FORMULA]', '[SEP]', '[FORMULA]'] diff --git a/docs/source/tutorial/en/tokenization/PureTextTokenizer.rst b/docs/source/tutorial/en/tokenization/PureTextTokenizer.rst new file mode 100644 index 00000000..88ec975e --- /dev/null +++ b/docs/source/tutorial/en/tokenization/PureTextTokenizer.rst @@ -0,0 +1,31 @@ +PureTextTokenizer +=================== + +By default, the pictures, labels, separators, blanks in the question text and other parts of the incoming item are converted into special characters for data security. At the same time, special formulas such as $\\FormFigureID{...}$ and $\\FormFigureBase64{...}$ are screened out to facilitate the tokenization of text and plain text formulas. Also, the tokenizer uses linear analysis method for text and formulas, and the ``key`` parameter provided is used to preprocess the incoming item, which will be improved based on users' requirements in the future. + +Examples +---------- + +:: + + >>> tokenizer = PureTextTokenizer() + >>> items = ["有公式$\\FormFigureID{wrong1?}$,如图$\\FigureID{088f15ea-xxx}$,\ + ... 若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$"] + >>> tokens = tokenizer(items) + >>> next(tokens)[:10] + ['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z'] + >>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"] + >>> tokens = tokenizer(items) + >>> next(tokens) # doctest: +NORMALIZE_WHITESPACE + ['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<', + '0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',', + '\\quad', 'A', '\\cap', 'B', '='] + >>> items = [{ + ... "stem": "已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$", + ... "options": ["1", "2"] + ... }] + >>> tokens = tokenizer(items, key=lambda x: x["stem"]) + >>> next(tokens) # doctest: +NORMALIZE_WHITESPACE + ['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<', + '0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',', + '\\quad', 'A', '\\cap', 'B', '='] diff --git a/docs/source/tutorial/en/tokenization/TextTokenizer.rst b/docs/source/tutorial/en/tokenization/TextTokenizer.rst new file mode 100644 index 00000000..08991be6 --- /dev/null +++ b/docs/source/tutorial/en/tokenization/TextTokenizer.rst @@ -0,0 +1,27 @@ +TextTokenizer +================ + +By default, the pictures, labels, separators, blanks in the question text and other parts of the incoming item are converted into special characters for data security and tokenization of text and formulas. Also, the tokenizer uses linear analysis method for text and formulas, and the ``key`` parameter provided is used to preprocess the incoming item, which will be improved based on users' requirements in the future. + + +Examples +---------- + +:: + + >>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"] + >>> tokenizer = TextTokenizer() + >>> tokens = tokenizer(items) + >>> next(tokens) # doctest: +NORMALIZE_WHITESPACE + ['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<', + '0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',', + '\\quad', 'A', '\\cap', 'B', '='] + >>> items = [{ + ... "stem": "已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$", + ... "options": ["1", "2"] + ... }] + >>> tokens = tokenizer(items, key=lambda x: x["stem"]) + >>> next(tokens) # doctest: +NORMALIZE_WHITESPACE + ['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<', + '0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',', + '\\quad', 'A', '\\cap', 'B', '='] diff --git a/docs/source/tutorial/en/tokenize.rst b/docs/source/tutorial/en/tokenize.rst new file mode 100644 index 00000000..f6350614 --- /dev/null +++ b/docs/source/tutorial/en/tokenize.rst @@ -0,0 +1,173 @@ +Tokenization +============== + +Tokenization, known as word segmentation and sentence segmentation, is a basic but very important step in the field of NLP. +In EduNLP, we divided Tokenization into different levels according to different granularity. To avoid ambiguity, we define as follows: + +* Word/char level: word segmentation + +* Sentence level: sentence segmentation + +* Resource level: tokenization + +This module provides tokenization function of question text, converting questions into token sequences to facilitate the vectorization of questions. After that, each element in the sliced item needs word segmentation. In this step, there is a depth option. You can select all positions or some labels for segmentation according to your needs, such as \SIFSep and \SIFTag. You can also select where to add labels, either at the head and tail or only at the head or tail. + +There are two modes: one is linear mode, which is used for text processing (word segmentation using jieba library). The other one is ast mode, which is used to parse the formula. + +Word Segmentation +--------------------- + +Text-tokenization: A sentence (without formulas) consists of several "words" in order. The process of dividing a sentence into several words is called "Text-tokenization". According to the granularity of "words", it can be subdivided into "Word-tokenization" and "Char-tokenization". + +:: + + - Word-tokenization: each phrase is a token. + + - Char-tokenization: each character is a token. + + +Text-tokenization is divided into two main steps: + +1. Text-tokenization: + + - Word-tokenization: use the word segmentation tool to segment and extract words from the question text. Our project supports `jieba`. + + - Char-tokenization: process text by character. + +2. Filter: filter the specified stopwords. + + The default stopwords used in this project:`[stopwords] `_ + You can also use your own stopwords. The following example demonstrates how to use. + +Examples: + +:: + + from EduNLP.SIF.tokenization.text import tokenize + >>> text = "三角函数是基本初等函数之一" + >>> tokenize(text, granularity="word") + ['三角函数', '初等', '函数'] + + >>> tokenize(text, granularity="char") + ['三', '角', '函', '数', '基', '初', '函', '数'] + +Sentence Segmentation +---------------------------- + +During the process of sentence segmentation, a long document is divided into several sentences. Each sentence is a "token" (to be realized). + +Tokenization +-------------- + +Tokenization is comprehensive analysis. In this process, sentences with formulas are segmented into several markers. Each marker is a "token". + +The implementation of this function is tokenize function. The required results can be obtained by passing in items after Structural Component Segmentation. + +:: + + from EduNLP.Tokenizer import get_tokenizer + >>> items = "如图所示,则三角形$ABC$的面积是$\\SIFBlank$。$\\FigureID{1}$" + >>> tokenize(SegmentList(items)) + ['如图所示', '三角形', 'ABC', '面积', '\\\\SIFBlank', \\FigureID{1}] + >>> tokenize(SegmentList(items),formula_params={"method": "ast"}) + ['如图所示', '三角形', , '面积', '\\\\SIFBlank', \\FigureID{1}] + + + +You can view ``./EduNLP/Tokenizer/tokenizer.py`` and ``./EduNLP/Pretrain/gensim_vec.py`` for more tokenizers. We provide some encapsulated tokenizers for users to call them conveniently. Following is a complete list of tokenizers: + +- TextTokenizer + +- PureTextTokenizer + +- GensimSegTokenizer + +- GensimWordTokenizer + + +TextTokenizer ++++++++++++++++++++++ + +By default, the pictures, labels, separators, blanks in the question text and other parts of the incoming item are converted into special characters for data security and tokenization of text and formulas. Also, the tokenizer uses linear analysis method for text and formulas, and the ``key`` parameter provided is used to preprocess the incoming item, which will be improved based on users' requirements in the future. + +:: + + >>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"] + >>> tokenizer = TextTokenizer() + >>> tokens = tokenizer(items) + >>> next(tokens) # doctest: +NORMALIZE_WHITESPACE + ['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<', + '0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',', + '\\quad', 'A', '\\cap', 'B', '='] + >>> items = [{ + ... "stem": "已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$", + ... "options": ["1", "2"] + ... }] + >>> tokens = tokenizer(items, key=lambda x: x["stem"]) + >>> next(tokens) # doctest: +NORMALIZE_WHITESPACE + ['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<', + '0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',', + '\\quad', 'A', '\\cap', 'B', '='] + +PureTextTokenizer ++++++++++++++++++++++ + +By default, the pictures, labels, separators, blanks in the question text and other parts of the incoming item are converted into special characters for data security. At the same time, special formulas such as $\\FormFigureID{...}$ and $\\FormFigureBase64{...}$ are screened out to facilitate the tokenization of text and plain text formulas. Also, the tokenizer uses linear analysis method for text and formulas, and the ``key`` parameter provided is used to preprocess the incoming item, which will be improved based on users' requirements in the future. + +:: + + >>> tokenizer = PureTextTokenizer() + >>> items = ["有公式$\\FormFigureID{wrong1?}$,如图$\\FigureID{088f15ea-xxx}$,\ + ... 若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$"] + >>> tokens = tokenizer(items) + >>> next(tokens)[:10] + ['公式', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[SEP]', 'z'] + >>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"] + >>> tokens = tokenizer(items) + >>> next(tokens) # doctest: +NORMALIZE_WHITESPACE + ['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<', + '0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',', + '\\quad', 'A', '\\cap', 'B', '='] + >>> items = [{ + ... "stem": "已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$", + ... "options": ["1", "2"] + ... }] + >>> tokens = tokenizer(items, key=lambda x: x["stem"]) + >>> next(tokens) # doctest: +NORMALIZE_WHITESPACE + ['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<', + '0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',', + '\\quad', 'A', '\\cap', 'B', '='] + +GensimWordTokenizer ++++++++++++++++++++++++ + +By default, the pictures, blanks in the question text and other parts of the incoming item are converted into special characters for data security and the tokenization of text, formulas, labels and separators. Also, the tokenizer uses linear analysis method for text and abstract syntax tree method for formulas respectively. You can choose each of them by ``general`` parameter: + +-true, it means that the incoming item conforms to SIF and the linear analysis method should be used. +-false, it means that the incoming item doesn't conform to SIF and the abstract syntax tree method should be used. + +GensimSegTokenizer +++++++++++++++++++++ + +By default, the pictures, separators, blanks in the question text and other parts of the incoming item are converted into special characters for data security and tokenization of text, formulas and labels. Also, the tokenizer uses linear analysis method for text and abstract analysis method of syntax tree for formulas. + +Compared to GensimWordTokenizer, the main differences are: + +* It provides the depth option for segmentation position, such as \SIFSep and \SIFTag. +* By default, labels are inserted in the header of item components (such as text and formulas). + +Examples +---------- + +:: + + >>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True) + >>> token_item = tokenizer("有公式$\\FormFigureID{wrong1?}$,如图$\\FigureID{088f15ea-xxx}$,\ + ... 若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$") + >>> print(token_item.tokens[:10]) + ['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]'] + >>> tokenizer = GensimWordTokenizer(symbol="fgmas", general=False) + >>> token_item = tokenizer("有公式$\\FormFigureID{wrong1?}$,如图$\\FigureID{088f15ea-xxx}$,\ + ... 若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$") + >>> print(token_item.tokens[:10]) + ['公式', '[FORMULA]', '如图', '[FIGURE]', '[FORMULA]', '约束条件', '公式', '[FORMULA]', '[SEP]', '[FORMULA]'] diff --git a/docs/source/tutorial/en/tokenize/Sentence Segmentation.rst b/docs/source/tutorial/en/tokenize/Sentence Segmentation.rst new file mode 100644 index 00000000..902b2bb5 --- /dev/null +++ b/docs/source/tutorial/en/tokenize/Sentence Segmentation.rst @@ -0,0 +1,3 @@ +Sentence Segmentation +------------------------- +During the process of sentence segmentation, a long document is divided into several sentences. Each sentence is a "token" (to be realized). diff --git a/docs/source/tutorial/en/tokenize/Tokenization.rst b/docs/source/tutorial/en/tokenize/Tokenization.rst new file mode 100644 index 00000000..c955602b --- /dev/null +++ b/docs/source/tutorial/en/tokenize/Tokenization.rst @@ -0,0 +1,29 @@ +Tokenization +-------------- +Tokenization is comprehensive analysis. In this process, sentences with formulas are segmented into several markers. Each marker is a "token". +We provide some encapsulated tokenizers for users to call them conveniently. The following is a complete list of tokenizers. + +Examples + +:: + + >>> items = ["已知集合$A=\\left\\{x \\mid x^{2}-3 x-4<0\\right\\}, \\quad B=\\{-4,1,3,5\\}, \\quad$ 则 $A \\cap B=$"] + >>> tokenizer = TextTokenizer() + >>> tokens = tokenizer(items) + >>> next(tokens) # doctest: +NORMALIZE_WHITESPACE + ['已知', '集合', 'A', '=', '\\left', '\\{', 'x', '\\mid', 'x', '^', '{', '2', '}', '-', '3', 'x', '-', '4', '<', + '0', '\\right', '\\}', ',', '\\quad', 'B', '=', '\\{', '-', '4', ',', '1', ',', '3', ',', '5', '\\}', ',', + '\\quad', 'A', '\\cap', 'B', '='] + + + +You can view ``./EduNLP/Tokenizer/tokenizer.py`` and ``./EduNLP/Pretrain/gensim_vec.py`` for more tokenizers. Following is a complete list of tokenizers: + +.. toctree:: + :maxdepth: 1 + :titlesonly: + + ../tokenization/TextTokenizer + ../tokenization/PureTextTokenizer + ../tokenization/GensimSegTokenizer + ../tokenization/GensimWordTokenizer diff --git a/docs/source/tutorial/en/tokenize/WordSegmentation.rst b/docs/source/tutorial/en/tokenize/WordSegmentation.rst new file mode 100644 index 00000000..181f0b80 --- /dev/null +++ b/docs/source/tutorial/en/tokenize/WordSegmentation.rst @@ -0,0 +1,36 @@ +Word segmentation +--------------------- + +Text-tokenization: A sentence (without formulas) consists of several "words" in order. The process of dividing a sentence into several words is called "Text-tokenization". According to the granularity of "words", it can be subdivided into "Word-tokenization" and "Char-tokenization". + +:: + + - Word-tokenization: each phrase is a token. + + - Char-tokenization: each character is a token. + + +Text-tokenization is divided into two main steps: + +1. Text-tokenization: + + - Word-tokenization: use the word segmentation tool to segment and extract words from the question text. Our project supports `jieba`. + + - Char-tokenization: process text by character. + +2. Filter: filter the specified stopwords. + + The default stopwords used in this project:`[stopwords] `_ + You can also use your own stopwords. The following example demonstrates how to use. + +Examples: + +:: + + >>> text = "三角函数是基本初等函数之一" + >>> tokenize(text, granularity="word") + ['三角函数', '初等', '函数'] + + >>> tokenize(text, granularity="char") + ['三', '角', '函', '数', '基', '初', '函', '数'] + diff --git a/docs/source/tutorial/en/vectorization.rst b/docs/source/tutorial/en/vectorization.rst new file mode 100644 index 00000000..eb59a34c --- /dev/null +++ b/docs/source/tutorial/en/vectorization.rst @@ -0,0 +1,158 @@ +Vectorization +================== + +This section provides a simple interface to convert the incoming items into vectors directly. Currently, the option of whether to use the pre training model is provided. You can choose according to your needs. If you don't want to use the pre-trained model, you can call D2V directly, or call get_pretrained_i2v function if you want to use the pre-trained model. + +- Don't use the pre-trained model + +- Use the pre-trained model + +Overview Flow +--------------------------- + +1.Perform `syntax parsing `_ on incoming items to get items in SIF format; + +2.Perform `component segmentation `_ on sif_items; + +3.Perform `tokenization `_ on segmented items; + +4.Use the existing or pre-trained model we provided to convert the tokenized items into vectors. + + +Use the pre-training model: call get_pretrained_i2v directly +--------------------------------------------- + +Use the pre-training model provided by EduNLP to convert the given question text into vectors. + +* Advantages: Simple and convenient. + +* Disadvantages: Only the model given in the project can be used, which has great limitations. + +* Call this function to obtain the corresponding pre-training model. At present, the following pre training models are provided: d2v_all_256, d2v_sci_256, d2v_eng_256 and d2v_lit_256. + +Selection and Use of Models +#################################### + +Select the pre-training model according to the subject: + ++--------------------+------------------------+ +| Pre-training model name | Subject of model training data | ++====================+========================+ +| d2v_all_256 | all subject | ++--------------------+------------------------+ +| d2v_sci_256 | Science | ++--------------------+------------------------+ +| d2v_lit_256 | Arts | ++--------------------+------------------------+ +| d2v_eng_256 | English | ++--------------------+------------------------+ + + +The concrete process of processing +#################################### + +1.Download the corresponding preprocessing model + +2.Transfer the obtained model to D2V and process it with D2V + Convert the obtained model into D2V and process it through D2V + +Examples: + +:: + + >>> i2v = get_pretrained_i2v("d2v_sci_256") + >>> i2v(item) + + +Don't use the pre-trained model: call existing models directly +-------------------------------------------------------------------------- + +You can use any pre-trained model provided by yourself (just give the storage path of the model) to convert the given question text into vectors. + +* Advantages: it is flexible to use your own model and its parameters can be adjusted freely. + +Import modules ++++++++++++++++++++++++ + +:: + + from EduNLP.I2V import D2V,W2V,get_pretrained_i2v + from EduNLP.Vector import T2V,get_pretrained_t2v + +Models provided +++++++++++++++++++++ + +- W2V + +- D2V + +- T2V + +W2V +<<<<<<<<< + +This model directly uses the relevant model methods in the gensim library to convert words into vectors. Currently, there are four methods: + + - FastText + + - Word2Vec + + - KeyedVectors + +:: + + >>> i2v = get_pretrained_i2v("test_w2v", "examples/test_model/data/w2v") # doctest: +ELLIPSIS + >>> item_vector, token_vector = i2v(["有学者认为:‘学习’,必须适应实际"]) + >>> item_vector # doctest: +ELLIPSIS + array([[...]], dtype=float32) + +D2V +<<<<<<<<<<<< + +This model is a comprehensive processing method which can convert items into vectors. Currently, the following methods are provided: + +- d2v: call doc2vec module in gensim library to convert items into vectors. + +- BowLoader: call corpora module in gensim library to convert docs into bows. + +- TfidfLoader: call TfidfModel module in gensim library to convert docs into bows. + +:: + + >>> item = {"如图来自古希腊数学家希波克拉底所研究的几何图形.此图由三个半圆构成,三个半圆的直径分别为直角三角形$ABC$的斜边$BC$, 直角边$AB$, $AC$.$\bigtriangleup ABC$的三边所围成的区域记为$I$,黑色部分记为$II$, 其余部分记为$III$.在整个图形中随机取一点,此点取自$I,II,III$的概率分别记为$p_1,p_2,p_3$,则$\SIFChoice$$\FigureID{1}$"} + >>> model_path = "../test_model/test_gensim_luna_stem_tf_d2v_256.bin" + >>> i2v = D2V("text","d2v",filepath=model_path, pretrained_t2v = False) + >>> i2v(item) + ([array([ 4.76559885e-02, -1.60574958e-01, 1.94614579e-03, 2.40295693e-01, + 2.24517003e-01, -3.24351490e-02, 4.35789041e-02, -1.65670961e-02,... + +T2V +<<<<<<<<<< + +You can use any pre-trained model provided by yourself to represent the segmentation sequences of a group of questions as vectors (just give the storage path of the model). + +- Advantages: the model and its parameters can be adjusted independently and has strong flexibility. + +Input +^^^^^^^^^^ + +Types: list +Contents: the combination of each question segmentation sequences in one question group. +>You can transfer question text (`str` type) to tokens using ``GensimWordTokenizer`` model + +:: + + >>> token_items=['公式','[FORMULA]','公式','[FORMULA]','如图','[FIGURE]','x',',','y','约束条件','[SEP]','z','=','x','+','7','y','最大值','[MARK]'] + >>> path = "../test_model/test_gensim_luna_stem_tf_d2v_256.bin" + >>> t2v = T2V('d2v',filepath=path) + >>> t2v(token_items) + [array([ 0.0256574 , 0.06061139, -0.00121044, -0.0167674 , -0.0111706 , + 0.05325712, -0.02097339, -0.01613594, 0.02904145, 0.0185046 ,... + +Specific process of processing +++++++++++++++++++++++++++++++++++++++++ + +1.Call get_tokenizer function to get the result after word segmentation; + +2.Select the model provided for vectorization depending on the model used. + diff --git a/docs/source/tutorial/en/vectorization/WithPre-trainedModel.rst b/docs/source/tutorial/en/vectorization/WithPre-trainedModel.rst new file mode 100644 index 00000000..844fdd3b --- /dev/null +++ b/docs/source/tutorial/en/vectorization/WithPre-trainedModel.rst @@ -0,0 +1,42 @@ +Use the pre-training model: call get_pretrained_i2v directly +-------------------------------------------------------------------- + +Use the pre-training model provided by EduNLP to convert the given question text into vectors. + +* Advantages: Simple and convenient. + +* Disadvantages: Only the model given in the project can be used, which has great limitations. + +* Call this function to obtain the corresponding pre-training model. At present, the following pre training models are provided: d2v_all_256, d2v_sci_256, d2v_eng_256 and d2v_lit_256. + +Selection and use of models +#################################### + +Select the pre-training model according to the subject: + ++--------------------+------------------------+ +| Pre-training model name | Subject of model training data | ++====================+========================+ +| d2v_all_256 | all subject | ++--------------------+------------------------+ +| d2v_sci_256 | Science | ++--------------------+------------------------+ +| d2v_lit_256 | Arts | ++--------------------+------------------------+ +| d2v_eng_256 | English | ++--------------------+------------------------+ + +The concrete process of processing +#################################### + +1.Download the corresponding preprocessing model + +2.Transfer the obtained model to D2V and process it with D2V + Convert the obtained model into D2V and process it through D2V + +Examples: + +:: + + >>> i2v = get_pretrained_i2v("d2v_sci_256") + >>> i2v(item) diff --git a/docs/source/tutorial/en/vectorization/WithoutPre-trainedModel.rst b/docs/source/tutorial/en/vectorization/WithoutPre-trainedModel.rst new file mode 100644 index 00000000..62ce6155 --- /dev/null +++ b/docs/source/tutorial/en/vectorization/WithoutPre-trainedModel.rst @@ -0,0 +1,21 @@ +Don't use the pre-trained model: call existing models directly +---------------------------------------------------------------- + +You can use any pre-trained model provided by yourself (just give the storage path of the model) to convert the given question text into vectors. + +* Advantages: it is flexible to use your own model and its parameters can be adjusted freely. + +Specific process of processing ++++++++++++++++++++++++++++++++++++ + +1.Call get_tokenizer function to get the result after word segmentation; + +2.Select the model provided for vectorization depending on the model used. + +Examples: + +:: + + >>> model_path = "../test_model/test_gensim_luna_stem_tf_d2v_256.bin" + >>> i2v = D2V("text","d2v",filepath=model_path, pretrained_t2v = False) + >>> i2v(item) diff --git a/docs/source/tutorial/zh/vectorization.rst b/docs/source/tutorial/zh/vectorization.rst index 8c57cac7..aff364ff 100644 --- a/docs/source/tutorial/zh/vectorization.rst +++ b/docs/source/tutorial/zh/vectorization.rst @@ -19,6 +19,49 @@ 4.使用已有或者使用提供的预训练模型,将令牌化后的item转换为向量。 +使用预训练模型:直接调用get_pretrained_i2v +--------------------------------------------- + +使用 EduNLP 项目组给定的预训练模型将给定的题目文本转成向量。 + +* 优点:简单方便。 + +* 缺点:只能使用项目中给定的模型,局限性较大。 + +* 调用此函数即可获得相应的预训练模型,目前提供以下的预训练模型:d2v_all_256、d2v_sci_256、d2v_eng_256、d2v_lit_256 + +模型选择与使用 +################## + +根据题目所属学科选择预训练模型: + ++--------------------+------------------------+ +| 预训练模型名称 | 模型训练数据的所属学科 | ++====================+========================+ +| d2v_all_256 | 全学科 | ++--------------------+------------------------+ +| d2v_sci_256 | 理科 | ++--------------------+------------------------+ +| d2v_lit_256 | 文科 | ++--------------------+------------------------+ +| d2v_eng_256 | 英语 | ++--------------------+------------------------+ + +处理的具体流程 +################## + +1.下载相应的预处理模型 + +2.将所得到的模型传入D2V,使用D2V进行处理 + +Examples: + +:: + + >>> i2v = get_pretrained_i2v("d2v_sci_256") + >>> i2v(item) + + 不使用预训练模型:直接调用已有模型 ------------------------------------ @@ -110,46 +153,3 @@ T2V 1.调用get_tokenizer函数,得到经过分词后的结果; 2.根据使用的模型,选择提供的模型类型,进行向量化处理。 - - -使用预训练模型:直接调用get_pretrained_i2v ---------------------------------------------- - -使用 EduNLP 项目组给定的预训练模型将给定的题目文本转成向量。 - -* 优点:简单方便。 - -* 缺点:只能使用项目中给定的模型,局限性较大。 - -* 调用此函数即可获得相应的预训练模型,目前提供以下的预训练模型:d2v_all_256、d2v_sci_256、d2v_eng_256、d2v_lit_256 - -模型选择与使用 -################## - -根据题目所属学科选择预训练模型: - -+--------------------+------------------------+ -| 预训练模型名称 | 模型训练数据的所属学科 | -+====================+========================+ -| d2v_all_256 | 全学科 | -+--------------------+------------------------+ -| d2v_sci_256 | 理科 | -+--------------------+------------------------+ -| d2v_lit_256 | 文科 | -+--------------------+------------------------+ -| d2v_eng_256 | 英语 | -+--------------------+------------------------+ - -处理的具体流程 -################## - -1.下载相应的预处理模型 - -2.将所得到的模型传入D2V,使用D2V进行处理 - -Examples: - -:: - - >>> i2v = get_pretrained_i2v("d2v_sci_256") - >>> i2v(item)