Multi Word Expressions (v0.3) #32

apmoore1 · 2022-05-04T12:19:01Z

Added

Roadmap added.
Define the MWE template and it's syntax, this is stated in Notes -> Multi Word Expression Syntax in the Usage section of the documentation. This is the first task of issue #24.
PEP 561 (Distributing and Packaging Type Information) compatible by adding py.typed file.
Added srsly as a pip requirement, we use srsly to serialise components to bytes, for example the pymusas.lexicon_collection.LexiconCollection.to_bytes function uses srsly to serialise the LexiconCollection to bytes.
An abstract class, pymusas.base.Serialise, that requires sub-classes to create two methods to_bytes and from_bytes so that the class can be serialised.
pymusas.lexicon_collection.LexiconCollection has three new methods to_bytes, from_bytes, and __eq__. This allows the collection to be serialised and to be compared to other collections.
A Lexicon Collection class for Multi Word Expression (MWE), pymusas.lexicon_collection.MWELexiconCollection, which allows a user to easily create and / or load in from a TSV file a MWE lexicon, like the MWE lexicons from the Multilingual USAS repository. In addition it contains the functionality to match a MWE template to templates stored in the MWELexiconCollection class following the MWE special syntax rules, this is all done through the mwe_match method. It also supports Part Of Speech mapping so that you can map from the lexicon's POS tagset to the tagset of your choice, in both a one-to-one and one-to-many mapping. Like the pymusas.lexicon_collection.LexiconCollection it contains to_bytes, from_bytes, and __eq__ methods for serialisation and comparisons.
The rule based taggers have now been componentised so that they are based off a List of Rules and a Ranker whereby each Rule defines how a token(s) in a text can be matched to a semantic category. Given the matches from the Rules the for each token, a token can have zero or more matches, the Ranker ranks each match and finds the global best match for each token in the text. The taggers now support direct match and wildcard Multi Word Expressions. Due to this:
- pymusas.taggers.rule_based.USASRuleBasedTagger has been changed and re-named to pymusas.taggers.rule_based.RuleBasedTagger and now only has a __call__ method.
- pymusas.spacy_api.taggers.rule_based.USASRuleBasedTagger has been changed and re-named to pymusas.spacy_api.taggers.rule_based.RuleBasedTagger.
A Rule system, of which all rules can be found in pymusas.taggers.rules:
- pymusas.taggers.rules.rule.Rule an abstract class that describes how other sub-classes define the __call__ method and it's signature. This abstract class is sub-classed from pymusas.base.Serialise.
- pymusas.taggers.rules.single_word.SingleWordRule a concrete sub-class of Rule for finding Single word lexicon entry matches.
- pymusas.taggers.rules.mwe.MWERule a concrete sub-class of Rule for finding Multi Word Expression entry matches.
A Ranking system, of which all of the components that are linked to ranking can be found in pymusas.rankers:
- pymusas.rankers.ranking_meta_data.RankingMetaData describes a lexicon entry match, that are typically generated from pymusas.taggers.rules.rule.Rule classes being called. These matches indicate that some part of a text, one or more tokens, matches a lexicon entry whether that is a Multi Word Expression or single word lexicon.
- pymusas.rankers.lexicon_entry.LexiconEntryRanker an abstract class that describes how other sub-classes should rank each token in the text and the expected output through the class's __call__ method. This abstract class is sub-classed from pymusas.base.Serialise.
- pymusas.rankers.lexicon_entry.ContextualRuleBasedRanker a concrete sub-class of LexiconEntryRanker based off the ranking rules from Piao et al. 2003.
- pymusas.rankers.lexical_match.LexicalMatch describes the lexical match within a pymusas.rankers.ranking_meta_data.RankingMetaData object.
pymusas.utils.unique_pos_tags_in_lexicon_entry a function that given a lexicon entry, either Multi Word Expression or Single word, returns a Set[str] of unique POS tags in the lexicon entry.
pymusas.utils.token_pos_tags_in_lexicon_entry a function that given a lexicon entry, either Multi Word Expression or Single word, yields a Tuple[str, str] of word and POS tag from the lexicon entry.
A mapping from USAS core to Universal Part Of Speech (UPOS) tagset.
A mapping from USAS core to basic CorCenCC POS tagset.
A mapping from USAS core to Penn Chinese Treebank POS tagset tagset.
pymusas.lexicon_collection.LexiconMetaData, object that contains all of the meta data about a single or Multi Word Expression lexicon entry.
pymusas.lexicon_collection.LexiconType which describes the different types of single and Multi Word Expression (MWE) lexicon entires and templates that PyMUSAS uses or will use in the case of curly braces.
The usage documentation, for the "How-to Tag Text", has been updated so that it includes an Indonesian example which does not use spaCy instead uses the Indonesian TreeTagger.
spaCy registered functions for reading in a LexiconCollection or MWELexiconCollection from a TSV. These can be found in pymusas.spacy_api.lexicon_collection.
spaCy registered functions for creating SingleWordRule and MWERule. These can be found in pymusas.spacy_api.taggers.rules.
spaCy registered function for creating ContextualRuleBasedRanker. This can be found in pymusas.spacy_api.rankers.
spaCy registered function for creating a List of Rules, this can be found here: pymusas.spacy_api.taggers.rules.rule_list.
LexiconCollection and MWELexiconCollection open the TSV file downloaded through from_tsv method by default using utf-8 encoding.
pymusas_rule_based_tagger is now a spacy registered factory by using an entry point.
MWELexiconCollection warns users that it does not support curly braces MWE template expressions.
All of the POS mappings can now be called through a spaCy registered function, all of these functions can be found in the pymusas.spacy_api.pos_mapper module.
Updated the Introduction and How-to Tag Text usage documentation with the new updates that PyMUSAS now supports, e.g. MWE's. Also the How-to Tag Text is updated so that it uses the pre-configured spaCy components that have been created for each language, this spaCy components can be found and downloaded from the pymusas-models repository.

Removed

pymusas.taggers.rule_based.USASRuleBasedTagger this is now replaced with pymusas.taggers.rule_based.RuleBasedTagger.
pymusas.spacy_api.taggers.rule_based.USASRuleBasedTagger this is now replaced with pymusas.spacy_api.taggers.rule_based.RuleBasedTagger.
Using PyMUSAS usage documentation page as it requires updating.

…e ignored

codecov-commenter · 2022-05-04T12:24:19Z

Codecov Report

Merging #32 (5feb6ef) into main (d381180) will increase coverage by 1.42%.
The diff coverage is 99.00%.

@@            Coverage Diff             @@
##             main      #32      +/-   ##
==========================================
+ Coverage   97.62%   99.05%   +1.42%     
==========================================
  Files           8       21      +13     
  Lines         337     1057     +720     
  Branches       66      214     +148     
==========================================
+ Hits          329     1047     +718     
+ Misses          7        0       -7     
- Partials        1       10       +9

Impacted Files	Coverage Δ
pymusas/lexicon_collection.py	`98.02% <97.49%> (-1.98%)`	⬇️
pymusas/taggers/rules/mwe.py	`98.73% <98.73%> (ø)`
pymusas/rankers/lexicon_entry.py	`99.18% <99.18%> (ø)`
pymusas/__init__.py	`100.00% <100.00%> (ø)`
pymusas/base.py	`100.00% <100.00%> (ø)`
pymusas/pos_mapper.py	`100.00% <100.00%> (ø)`
pymusas/rankers/lexical_match.py	`100.00% <100.00%> (ø)`
pymusas/rankers/ranking_meta_data.py	`100.00% <100.00%> (ø)`
pymusas/spacy_api/lexicon_collection.py	`100.00% <100.00%> (+100.00%)`	⬆️
pymusas/spacy_api/pos_mapper.py	`100.00% <100.00%> (ø)`
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dcd90e0...5feb6ef. Read the comment docs.

apmoore1 and others added 30 commits January 19, 2022 11:56

Start of the MWE syntax guide

cd1136c

Fixed broken link

738ff4a

Re-organisation of the test data files/folders

fc6fbe4

Update MWE syntax definitions and examples

9f1512d

Merge branch 'mwe' of github.com:UCREL/pymusas into mwe

03d3141

moved lexicon collection tests into a seperate folder

dc1f2e3

Moved LexiconCollection test data into its own folder

dc9675e

Moved LexiconCollection test data into its own folder

89adf29

Adds support for raw docstrings

8823d94

MWE Lexicon Collection

d48656a

First version of MWE matching with no special syntax #24

bf83c87

isort issue resolved

b765d4b

Merge branch 'main' into mwe

99ac3f7

MWE direct lookup can handle regular expression special syntax

5a8b325

MWE benchmark for MWE direct lookup

036a85e

Made the MWE direct lookup more efficient

e76c719

isort and flake8 corrected and removed an if statement

3119906

Documentation

5ead038

Corrected python examples

1e8f7f6

MWE lexicon collection can detect MWE given an MWE template

0b3b5bc

Refactored lexicon entry from collection in the tests

ee2fe06

Added Lexicon Meta Data object

526a395

Added LexiconMetaData to MWELexiconCollection

e57915f

Ranker to rank output from single and MWE rules

2051545

Test empty list parameter for rule based ranker

1f52680

missing assert statement

2024873

Better example

0824a3e

Added MWE Lexicon Rules

7f11774

Added MWE Lexicon Rules

2649042

Added semantic_tags to RankingMetaData object

2c43738

apmoore1 added 24 commits March 29, 2022 16:21

Click issue with version 8.1.0

e4b75a5

pytest issue with version 7.1.0

4b8a22c

Click issue with version 8.1.0

89d59ec

spacy registered functions for tagger rules

543b251

Click issue with version 8.1.0

fed00b2

spacy registered function for ContextualRuleBasedRanker

745b57a

isort

e48017a

CI does not fail on windows when it should, DEBUGGING

67ee480

CI does not fail on windows when it should, Fixed

8c21fc8

No longer use OS default encoding

37fb15e

Added rule_list spacy registered function

014f73d

spacy factory entry point

17f7821

spacy factory entry point

1e2d045

isort

f186803

@reader to @misc due to config file format

5042323

MWE Lexicon Collection can handle curly braces being added but will b…

6da04a9

…e ignored

Changed API loading page to the base module

0b288bb

Added spacy registered functions for pos mappers

2ab0d4b

version 0.3.0

4ff95aa

Needs to be updated before being added back into the documentation

61b8265

Added that we support MWE and have models that can be downloaded

91a7089

Updated so that it uses the pre-configured models

9b63279

Added link to MWE syntax notes

39b88ae

Added the changes to the documentation

5feb6ef

apmoore1 requested a review from perayson May 4, 2022 12:19

apmoore1 assigned perayson May 4, 2022

perayson merged commit a0f748b into main May 4, 2022

perayson deleted the mwe branch May 4, 2022 13:47

apmoore1 mentioned this pull request May 4, 2022

Multi Word Expressions #24

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi Word Expressions (v0.3) #32

Multi Word Expressions (v0.3) #32

apmoore1 commented May 4, 2022

codecov-commenter commented May 4, 2022 •

edited

Multi Word Expressions (v0.3) #32

Multi Word Expressions (v0.3) #32

Conversation

apmoore1 commented May 4, 2022

Added

Removed

codecov-commenter commented May 4, 2022 • edited

Codecov Report

codecov-commenter commented May 4, 2022 •

edited