-
Notifications
You must be signed in to change notification settings - Fork 20
[DOC] Documentation content changes #104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
9000793
Create parse.rst
BAOOOOOM d7c6b9d
Merge pull request #88 from test2021413/docs
BAOOOOOM c2b4d87
Merge pull request #89 from BAOOOOOM/docs
BAOOOOOM c4e72f9
Merge pull request #90 from BAOOOOOM/docs
BAOOOOOM 450d664
Merge pull request #91 from BAOOOOOM/docs
BAOOOOOM ed9ef6e
Merge pull request #92 from BAOOOOOM/docs
BAOOOOOM b7f3ff0
Merge pull request #93 from BAOOOOOM/docs
BAOOOOOM dd3adf5
Merge pull request #94 from BAOOOOOM/docs
BAOOOOOM cfc39a6
Merge pull request #95 from BAOOOOOM/docs
BAOOOOOM de8a239
Merge pull request #97 from BAOOOOOM/docs
BAOOOOOM c926ed2
Merge pull request #98 from BAOOOOOM/docs
BAOOOOOM d3e4422
Add English version tutorial
fannazya c84f5ac
Revert "Add English version tutorial"
fannazya 66f91c1
Revert "Revert "Add English version tutorial""
fannazya 3b76b5f
update English version
fannazya 9c62f8b
Merge pull request #102 from fannazya/docs
fannazya 3d9e4ac
update English version tutorial
fannazya File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,2 +1,52 @@ | ||
| Get Started | ||
| =========== | ||
| =============== | ||
|
|
||
| * `Standard Item Format <sif.rst>`_ | ||
|
|
||
| * `Syntax Parsing <parse.rst>`_ | ||
|
|
||
| * `Component Segmentation <seg.rst>`_ | ||
|
|
||
| * `Tokenization <tokenize.rst>`_ | ||
|
|
||
| * `Pre-training <pretrain.rst>`_ | ||
|
|
||
| * `Vectorization <vectorization.rst>`_ | ||
|
|
||
| Main process | ||
| --------------- | ||
|
|
||
| .. figure:: ../../_static/new_flow.png | ||
|
|
||
| * `Syntax Parsing <parse.rst>`_ : Its function is to convert the incoming item into SIF format, which means letters and numbers should be between ``$...$`` and the brackets and underlines of the choice questions should be converted to special symbols we defined in SIF) | ||
|
|
||
| * `Component Segmentation <seg.rst>`_ : Its function is to segment items in SIF format according to the types of items, so as to serve the later tokenization module.(that is, elements in different types can be tokenized using their corresponding methods)。 | ||
|
|
||
| * `Tokenization <tokenize.rst>`_: Its function is to tokenize segmented items, so as to serve the later tokenization module. | ||
| Generally, the tokenization method in the text form can be used directly. For formulas, the ast method can also be used for parsing(call the formula module). | ||
|
|
||
| * `Vectorization <vectorization.rst>`_: This part mainly calls I2V class and its subclasses. Its function is to vectorize the list of tokenized items, so as to get the corresponding static vectors. | ||
| For vectorization module, You can call your own trained model or directly call the provided pre-training model(call get_pretrained_I2V module). | ||
|
|
||
| * **Downstream Model**:Process the obtained vectors to get the desired results. | ||
|
|
||
| Examples | ||
| --------- | ||
|
|
||
| To help you quickly understand the functions of this project, this section only shows the usages of common function interface. Intermediate function modules (such as parse, formula, segment, etc.) and more subdivided interface methods are not shown. For further study, please refer to relevant documents. | ||
|
|
||
| .. nbgallery:: | ||
| :caption: This is a thumbnail gallery: | ||
| :name: tokenize_gallery | ||
| :glob: | ||
|
|
||
| Tokenization <../../build/blitz/tokenizer/tokenizer.ipynb> | ||
|
|
||
|
|
||
|
|
||
| .. nbgallery:: | ||
| :caption: This is a thumbnail gallery: | ||
| :name: vectorization_gallery | ||
| :glob: | ||
|
|
||
| Vectorization <../../build/blitz/vectorization/total_vector.ipynb> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,291 @@ | ||
| Syntax Parsing | ||
| ================= | ||
|
|
||
| In educational resources, texts and formulas have internal implicit or explicit syntax structures. It is of great benefit for further processing to extract these structures. | ||
|
|
||
| * Text syntax structure parsing | ||
|
|
||
| * Formula syntax structure parsing | ||
|
|
||
| The purpose is as follows: | ||
|
|
||
|
|
||
| 1. Represent underlines of blanks and brackets of choices with special identifiers. And the alphabets and formulas should be wrapped with $$, so that items of different types can be cut accurately through the symbol $. | ||
| 2. Determine whether the current item is legal and report the error type. | ||
|
|
||
| Specific processing content | ||
| -------------------------------- | ||
|
|
||
| 1.Its function is to match alphabets and numbers other than formulas. Only the alphabets and numbers between two Chinese characters will be corrected, and the rest of the cases are regarded as formulas that do not conform to latex syntax. | ||
|
|
||
| 2.Match brackets like "( )" (both English format and Chinese format), that is, brackets with no content or spaces, which should be replaced with ``$\\SIFChoice$`` | ||
|
|
||
| 3.Match continuous underscores or underscores with spaces and replace them with ``$\\SIFBlank$``. | ||
|
|
||
| 4.Match latex formulas,check the completeness and analyzability of latex formulas, and report an error for illegal formula. | ||
|
|
||
| Formula syntax structure parsing | ||
| ------------------------------------- | ||
|
|
||
| This section is mainly realized by EduNLP. Formula modules, which can determine if the text has syntax errors and convert the syntax formula into the form of ast tree. In practice, this module is often used as part of an intermediate process, and the relevant parameters of this module can be automatically chosen by calling the corresponding model, so it generally does not need special attention. | ||
|
|
||
| Introduction of Main Introduction | ||
| +++++++++++++++++++++++++++++++++++++++ | ||
|
|
||
| 1.Formula: determine whether the single formula passed in is in str form. If so, use the ast method for processing, otherwise an error will be reported. In addition, parameter variable_standardization is given. If this parameter is true, the variable standardization method will be used to make sure the same variable has the same variable number. | ||
|
|
||
| 2.FormulaGroup: If you need to pass in a formula set, you can call this interface to get an ast forest. The tree structure in the forest is the same as that of Formula. | ||
|
|
||
| Formula | ||
| >>>>>>>>>>>> | ||
|
|
||
| Formula: firstly, in the word segmentation function, the formula of the original text is segmented. In addition, ``Formula parse tree`` function is provided, which can represent the abstract syntax analysis tree of mathematical formula in the form of text or picture. | ||
|
|
||
| This module also provides the function of formula variable standardization, such as determining whether 'x' in several sub formulas is the same variable. | ||
|
|
||
| Import modules | ||
| +++++++++++++++++++++ | ||
|
|
||
| :: | ||
|
|
||
| import matplotlib.pyplot as plt | ||
| from EduNLP.Formula import Formula | ||
| from EduNLP.Formula.viz import ForestPlotter | ||
|
|
||
| Initialization | ||
| +++++++++++++++ | ||
|
|
||
| Incoming parameters: item | ||
|
|
||
| Item is the latex formula or the abstract syntax parse tree generated after the formula is parsed and its type is str or List[Dict]. | ||
|
|
||
| :: | ||
|
|
||
| >>> f=Formula("x^2 + x+1 = y") | ||
| >>> f | ||
| <Formula: x^2 + x+1 = y> | ||
|
|
||
| View the specific content after formula segmentation | ||
| ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | ||
|
|
||
| - View node elements after formula segmentation | ||
|
|
||
| :: | ||
|
|
||
| >>> f.elements | ||
| [{'id': 0, 'type': 'supsub', 'text': '\\supsub', 'role': None}, | ||
| {'id': 1, 'type': 'mathord', 'text': 'x', 'role': 'base'}, | ||
| {'id': 2, 'type': 'textord', 'text': '2', 'role': 'sup'}, | ||
| {'id': 3, 'type': 'bin', 'text': '+', 'role': None}, | ||
| {'id': 4, 'type': 'mathord', 'text': 'x', 'role': None}, | ||
| {'id': 5, 'type': 'bin', 'text': '+', 'role': None}, | ||
| {'id': 6, 'type': 'textord', 'text': '1', 'role': None}, | ||
| {'id': 7, 'type': 'rel', 'text': '=', 'role': None}, | ||
| {'id': 8, 'type': 'mathord', 'text': 'y', 'role': None}] | ||
|
|
||
| - View the abstract parse tree of formulas | ||
|
|
||
| :: | ||
|
|
||
| >>> f.ast | ||
| [{'val': {'id': 0, 'type': 'supsub', 'text': '\\supsub', 'role': None}, | ||
| 'structure': {'bro': [None, 3],'child': [1, 2],'father': None,'forest': None}}, | ||
| {'val': {'id': 1, 'type': 'mathord', 'text': 'x', 'role': 'base'}, | ||
| 'structure': {'bro': [None, 2], 'child': None, 'father': 0, 'forest': None}}, | ||
| {'val': {'id': 2, 'type': 'textord', 'text': '2', 'role': 'sup'}, | ||
| 'structure': {'bro': [1, None], 'child': None, 'father': 0, 'forest': None}}, | ||
| {'val': {'id': 3, 'type': 'bin', 'text': '+', 'role': None}, | ||
| 'structure': {'bro': [0, 4], 'child': None, 'father': None, 'forest': None}}, | ||
| {'val': {'id': 4, 'type': 'mathord', 'text': 'x', 'role': None}, | ||
| 'structure': {'bro': [3, 5], 'child': None, 'father': None, 'forest': None}}, | ||
| {'val': {'id': 5, 'type': 'bin', 'text': '+', 'role': None}, | ||
| 'structure': {'bro': [4, 6], 'child': None, 'father': None, 'forest': None}}, | ||
| {'val': {'id': 6, 'type': 'textord', 'text': '1', 'role': None}, | ||
| 'structure': {'bro': [5, 7], 'child': None, 'father': None, 'forest': None}}, | ||
| {'val': {'id': 7, 'type': 'rel', 'text': '=', 'role': None}, | ||
| 'structure': {'bro': [6, 8], 'child': None, 'father': None, 'forest': None}}, | ||
| {'val': {'id': 8, 'type': 'mathord', 'text': 'y', 'role': None}, | ||
| 'structure': {'bro': [7, None],'child': None,'father': None,'forest': None}}] | ||
|
|
||
| >>> print('nodes: ',f.ast_graph.nodes) | ||
| nodes: [0, 1, 2, 3, 4, 5, 6, 7, 8] | ||
| >>> print('edges: ' ,f.ast_graph.edges) | ||
| edges: [(0, 1), (0, 2)] | ||
|
|
||
| - show the abstract parse tree by a picture | ||
|
|
||
| :: | ||
|
|
||
| >>> ForestPlotter().export(f.ast_graph, root_list=[node["val"]["id"] for node in f.ast if node["structure"]["father"] is None],) | ||
| >>> plt.show() | ||
|
|
||
|
|
||
| .. figure:: ../../_static/formula.png | ||
|
|
||
|
|
||
| Variable standardization | ||
| +++++++++++++++++++++++++++++ | ||
|
|
||
| This parameter makes the same variable have the same variable number. | ||
|
|
||
| For example: the number of variable ``x`` is ``0`` and the number of variable ``y`` is ``1``. | ||
|
|
||
| :: | ||
|
|
||
| >>> f.variable_standardization().elements | ||
| [{'id': 0, 'type': 'supsub', 'text': '\\supsub', 'role': None}, | ||
| {'id': 1, 'type': 'mathord', 'text': 'x', 'role': 'base', 'var': 0}, | ||
| {'id': 2, 'type': 'textord', 'text': '2', 'role': 'sup'}, | ||
| {'id': 3, 'type': 'bin', 'text': '+', 'role': None}, | ||
| {'id': 4, 'type': 'mathord', 'text': 'x', 'role': None, 'var': 0}, | ||
| {'id': 5, 'type': 'bin', 'text': '+', 'role': None}, | ||
| {'id': 6, 'type': 'textord', 'text': '1', 'role': None}, | ||
| {'id': 7, 'type': 'rel', 'text': '=', 'role': None}, | ||
| {'id': 8, 'type': 'mathord', 'text': 'y', 'role': None, 'var': 1}] | ||
|
|
||
| FormulaGroup | ||
| >>>>>>>>>>>>>>> | ||
|
|
||
| Call ``FormulaGroup`` class to parse the equations. The related attributes and functions are the same as those above. | ||
|
|
||
| :: | ||
|
|
||
| import matplotlib.pyplot as plt | ||
| from EduNLP.Formula import Formula | ||
| from EduNLP.Formula import FormulaGroup | ||
| from EduNLP.Formula.viz import ForestPlotter | ||
| >>> fs = FormulaGroup(["x^2 = y", "x^3 = y^2", "x + y = \pi"]) | ||
| >>> fs | ||
| <FormulaGroup: <Formula: x^2 = y>;<Formula: x^3 = y^2>;<Formula: x + y = \pi>> | ||
| >>> fs.elements | ||
| [{'id': 0, 'type': 'supsub', 'text': '\\supsub', 'role': None}, | ||
| {'id': 1, 'type': 'mathord', 'text': 'x', 'role': 'base'}, | ||
| {'id': 2, 'type': 'textord', 'text': '2', 'role': 'sup'}, | ||
| {'id': 3, 'type': 'rel', 'text': '=', 'role': None}, | ||
| {'id': 4, 'type': 'mathord', 'text': 'y', 'role': None}, | ||
| {'id': 5, 'type': 'supsub', 'text': '\\supsub', 'role': None}, | ||
| {'id': 6, 'type': 'mathord', 'text': 'x', 'role': 'base'}, | ||
| {'id': 7, 'type': 'textord', 'text': '3', 'role': 'sup'}, | ||
| {'id': 8, 'type': 'rel', 'text': '=', 'role': None}, | ||
| {'id': 9, 'type': 'supsub', 'text': '\\supsub', 'role': None}, | ||
| {'id': 10, 'type': 'mathord', 'text': 'y', 'role': 'base'}, | ||
| {'id': 11, 'type': 'textord', 'text': '2', 'role': 'sup'}, | ||
| {'id': 12, 'type': 'mathord', 'text': 'x', 'role': None}, | ||
| {'id': 13, 'type': 'bin', 'text': '+', 'role': None}, | ||
| {'id': 14, 'type': 'mathord', 'text': 'y', 'role': None}, | ||
| {'id': 15, 'type': 'rel', 'text': '=', 'role': None}, | ||
| {'id': 16, 'type': 'mathord', 'text': '\\pi', 'role': None}] | ||
| >>> fs.ast | ||
| [{'val': {'id': 0, 'type': 'supsub', 'text': '\\supsub', 'role': None}, | ||
| 'structure': {'bro': [None, 3], | ||
| 'child': [1, 2], | ||
| 'father': None, | ||
| 'forest': None}}, | ||
| {'val': {'id': 1, 'type': 'mathord', 'text': 'x', 'role': 'base'}, | ||
| 'structure': {'bro': [None, 2], | ||
| 'child': None, | ||
| 'father': 0, | ||
| 'forest': [6, 12]}}, | ||
| {'val': {'id': 2, 'type': 'textord', 'text': '2', 'role': 'sup'}, | ||
| 'structure': {'bro': [1, None], 'child': None, 'father': 0, 'forest': None}}, | ||
| {'val': {'id': 3, 'type': 'rel', 'text': '=', 'role': None}, | ||
| 'structure': {'bro': [0, 4], 'child': None, 'father': None, 'forest': None}}, | ||
| {'val': {'id': 4, 'type': 'mathord', 'text': 'y', 'role': None}, | ||
| 'structure': {'bro': [3, None], | ||
| 'child': None, | ||
| 'father': None, | ||
| 'forest': [10, 14]}}, | ||
| {'val': {'id': 5, 'type': 'supsub', 'text': '\\supsub', 'role': None}, | ||
| 'structure': {'bro': [None, 8], | ||
| 'child': [6, 7], | ||
| 'father': None, | ||
| 'forest': None}}, | ||
| {'val': {'id': 6, 'type': 'mathord', 'text': 'x', 'role': 'base'}, | ||
| show more (open the raw output data in a text editor) ... | ||
| >>> fs.variable_standardization()[0] | ||
| [{'id': 0, 'type': 'supsub', 'text': '\\supsub', 'role': None}, {'id': 1, 'type': 'mathord', 'text': 'x', 'role': 'base', 'var': 0}, {'id': 2, 'type': 'textord', 'text': '2', 'role': 'sup'}, {'id': 3, 'type': 'rel', 'text': '=', 'role': None}, {'id': 4, 'type': 'mathord', 'text': 'y', 'role': None, 'var': 1}] | ||
| >>> ForestPlotter().export(fs.ast_graph, root_list=[node["val"]["id"] for node in fs.ast if node["structure"]["father"] is None],) | ||
|
|
||
| .. figure:: ../../_static/formulagroup.png | ||
|
|
||
|
|
||
| Text syntax structure parsing | ||
| ------------------------------------ | ||
|
|
||
| This section is mainly realized by EduNLP.SIF.Parse module. Its main function is to extract letters and numbers in the text and convert them into standard format. | ||
|
|
||
| This module is mainly used as an *middle module* to parse the input text. In general, users do not call this module directly. | ||
|
|
||
| Introduction of main content | ||
| +++++++++++++++++++++++++++++++++++ | ||
|
|
||
| 1. Judge the type of the incoming text in the following order | ||
|
|
||
| * is_chinese: its function is to match Chinese characters[\u4e00-\u9fa5]. | ||
|
|
||
| * is_alphabet: its function is to match alphabets other than formulas. Only the alphabets between two Chinese characters will be corrected (wrapped with $$), and the rest of the cases are regarded as formulas that do not conform to latex syntax. | ||
|
|
||
| * is_number: its function is to match numbers other than formulas. Only the numbers between two Chinese characters will be corrected, and the rest of the cases are regarded as formulas that do not conform to latex syntax. | ||
|
|
||
| 2. Match latex formula | ||
|
|
||
| * If Chinese characters appear in latex, print warning only once. | ||
|
|
||
| * Use _is_formula_legal function, check the completeness and analyzability of latex formula, and report an error for formulas that do not conform to latex syntax. | ||
|
|
||
| Import modules | ||
| >>>>>>>>>>>>>>>>>>> | ||
|
|
||
| :: | ||
|
|
||
| from EduNLP.SIF.Parser import Parser | ||
|
|
||
| Input | ||
| >>>>>>> | ||
|
|
||
| Types: str | ||
|
|
||
| Content: question text | ||
|
|
||
| :: | ||
|
|
||
| >>> text1 = '生产某种零件的A工厂25名工人的日加工零件数_ _' | ||
| >>> text2 = 'X的分布列为( )' | ||
| >>> text3 = '① AB是⊙O的直径,AC是⊙O的切线,BC交⊙O于点E.AC的中点为D' | ||
| >>> text4 = '支持公式如$\\frac{y}{x}$,$\\SIFBlank$,$\\FigureID{1}$,不支持公式如$\\frac{ \\dddot y}{x}$' | ||
|
|
||
| Parsing | ||
| >>>>>>>>>>>>>>>>>>>> | ||
|
|
||
| :: | ||
|
|
||
| >>> text_parser1 = Parser(text1) | ||
| >>> text_parser2 = Parser(text2) | ||
| >>> text_parser3 = Parser(text3) | ||
| >>> text_parser4 = Parser(text4) | ||
|
|
||
| Related parameters description | ||
| >>>>>>>>>>>> | ||
|
|
||
| - Try to convert text to standard format | ||
|
|
||
| :: | ||
|
|
||
| >>> text_parser1.description_list() | ||
| >>> print('text_parser1.text:',text_parser1.text) | ||
| text_parser1.text: 生产某种零件的$A$工厂$25$名工人的日加工零件数$\SIFBlank$ | ||
| >>> text_parser2.description_list() | ||
| >>> print('text_parser2.text:',text_parser2.text) | ||
| text_parser2.text: $X$的分布列为$\SIFChoice$ | ||
|
|
||
| - Determine if the text has syntax errors | ||
|
|
||
| :: | ||
|
|
||
| >>> text_parser3.description_list() | ||
| >>> print('text_parser3.error_flag: ',text_parser3.error_flag) | ||
| text_parser3.error_flag: 1 | ||
| >>> text_parser4.description_list() | ||
| >>> print('text_parser4.fomula_illegal_flag: ',text_parser4.fomula_illegal_flag) | ||
| text_parser4.fomula_illegal_flag: 1 | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.