New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework release scripts for v1.4 #16
Conversation
At least for the codepoints in html.entities.codepoint2name and the custom ones, there are no names that are valid 4-digit hexadecimal values, so we don't need something like 0x prefixing the codepoint.
The latest changes address 3 of 4 of the tasks listed above. There are some new ones, though: |
I think the only remaining issues are the two above and the availability of |
Wn 0.8.1 had a packaging issue so I released Wn 0.8.2, however it looks like I might not have caught some LMF export issues, so I'm investigating before releasing 0.8.3. Regarding WordNet 3.0, my validation script found some other issues caused by issues in the source data (shown below, abbreviated):
I can fix (1) by simply dropping the redundant relations. (2) may require a manual fix, as we did for inhibit. |
Or maybe we just drop that one, too. That synset already has derivation links to unicyclist and unicycle (v). |
But I would be very happy to drop this. Maybe check both for duplicate
relations (and discard) and self loops (and discard)?
I think there is also a loop through categories with computer and computer
science, each being in the domain of the other --- but I am not sure this
is really a bug.
In OMW 2.0 I only checked for self-loops and I only cycles with hypernyms
(+ instances).
I would expect loops with derivations (although the self loop seems silly).
…On Tue, Nov 2, 2021 at 12:56 PM Michael Wayne Goodman < ***@***.***> wrote:
(2) may require a manual fix, as we did for *inhibit*.
Or maybe we just drop that one, too. That synset already has derivation
links to *unicyclist* and *unicycle* (v).
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIPZRSHNPOYM3CRVS2FVJLUJ5VRBANCNFSM5G3OFFZA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University
|
Ok sounds good. Also, WordNet 3.1 has the same problems, plus it also has about 40 synsets with the same ILI as another synset. E.g.:
|
Wow!
I have not looked at PWN 3.1 but this looks like a bug in the mapping.
@jmccrae do you know anything about this?
I guess we have four options for this release:
(i) randomly pick one
(ii) not link either (and not somewhere)
(iii) manually fix them
(iv) hope John has a solution
If (iv) does not happen then I would prefer to go with (ii).
…On Tue, Nov 2, 2021 at 1:04 PM Michael Wayne Goodman < ***@***.***> wrote:
Ok sounds good. Also, WordNet 3.1 has the same problems, plus it also has
about 40 synsets with the same ILI as another synset. E.g.:
Synset ID ILI
omw-en31-00767587-n i39442
omw-en31-00767761-n i39442
omw-en31-00780744-n i39502
omw-en31-00781071-n i39502
... ...
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIPZRSGIYODBXQZXCG6QGDUJ5WO5ANCNFSM5G3OFFZA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University
|
It seems that this is an error in the alignment script that was used to generate the WN 3.1 mapping. It seems that some of the synsets that were new to PWN 3.1 have in some cases been assigned a key in that file that is incorrect. Instead there should be no assigned key. I will fix this in the CILI repo |
Actually, if you look at Turtle version of the mapping you can see what the solution is:
In all cases, one is marked as a close match and one as an exact match |
I think the solution is to remove the mappings in the |
Returning to the above issue, I just compared the NLTK's wordnet files and the one's we're using, and they are currently identical (after we fixed the inhibit loop). That is, the NLTK also has the unicycle loop and redundant edges: >>> from nltk.corpus import wordnet as wn
>>> wn.lemmas('unicycle', pos='n')
[Lemma('unicycle.n.01.unicycle')]
>>> wn.lemmas('unicycle', pos='n')[0].derivationally_related_forms()
[Lemma('unicyclist.n.01.unicyclist'), Lemma('unicycle.n.01.unicycle'), Lemma('unicycle.v.01.unicycle')]
>>> wn.lemmas('benignancy')[0].derivationally_related_forms()
[Lemma('benign.s.03.benign'), Lemma('benign.s.03.benign')] Considering this, I think we should refrain from fixing further things, even if it's tempting. That's what PWN 3.1 and OEWN are for. We could document the known issues, though. An exception could be made if the data, due to those errors, is no longer loadable in Wn. |
OK, that makes sense.
I will add something to the Readme that these issues exist, and that they
are fixed in later versions.
…On Wed, Nov 3, 2021 at 12:54 PM Michael Wayne Goodman < ***@***.***> wrote:
Maybe check both for duplicate
relations (and discard) and self loops (and discard)?
Returning to the above issue, I just compared the NLTK's wordnet files and
the one's we're using, and they are currently identical (after we fixed the
*inhibit* loop). That is, the NLTK also has the unicycle loop and
redundant edges:
>>> from nltk.corpus import wordnet as wn
>>> wn.lemmas('unicycle', pos='n')
[Lemma('unicycle.n.01.unicycle')]
>>> wn.lemmas('unicycle', pos='n')[0].derivationally_related_forms()
[Lemma('unicyclist.n.01.unicyclist'), Lemma('unicycle.n.01.unicycle'), Lemma('unicycle.v.01.unicycle')]
>>> wn.lemmas('benignancy')[0].derivationally_related_forms()
[Lemma('benign.s.03.benign'), Lemma('benign.s.03.benign')]
Considering this, I think we should refrain from fixing further things,
even if it's tempting. That's what PWN 3.1 and OEWN are for. We could
document the known issues, though. An exception could be made if the data,
due to those errors, is no longer loadable in Wn.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIPZRUHHJXTTEKCQXSCRMLUKDFALANCNFSM5G3OFFZA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University
|
It points to the omw-data project page, which is not the URL for the resource, so it can't be used to automatically download the dependency.
Currently only README and LICENSE without extensions since we don't have README.md, LICENSE.txt, etc. in the data. It is trivial to enable these, though. Also, citation.bib is copied if present.
This way all XML file names are the same as their directories. The predictability makes other things easier.
Build script now requires a version argument. The release workflow should work either when the release is created or on a manual dispatch.
@fcbond I did a bit more work on this; see the commit messages above. I think most things are done now, but the new workflow is untested. I added a "workflow_dispatch" event, so you can click a "Run workflow" button on GitHub to start the action. This might be helpful if there's a failure or we otherwise need to rerun the build. It still starts automatically when a release is created. Anything else to do? |
The omw-en and omw-en31 lexicons are now generated from WordNet sources directly, so the stored xml.xz files are no longer needed. This commit also repurposes, and renames, those files' directories for storing additional files related to the wordnets, such as the LICENSE, README.md, and citation.bib files. This commit also updates and expands the READMEs a bit.
And a bit more work. I pushed the 0.8.3 version of Wn so that dependency is bumped here. I also removed the old pwn files and updated their READMEs, which are now copied into the built omw-en* directories. I realize that we're not building the |
The release script currently includes both omw-en and omw-en31 in omw-1.4.
I think we should probably only include omw-en. This is a difference from
omw-1.3 which included neither, but as the new wordnets require it, I think
it makes sense to bundle it together. I will push a fix.
Should we build a new index.toml? Should it be part of build.sh?
…On Thu, Nov 4, 2021 at 1:50 PM Michael Wayne Goodman < ***@***.***> wrote:
And a bit more work. I pushed the 0.8.3 version of Wn so that dependency
is bumped here. I also removed the old pwn files and updated their READMEs,
which are now copied into the built omw-en* directories.
I realize that we're not building the index.toml file for Wn anymore. But
otherwise I think this is done.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIPZRROGAOW2PGABOO6APDUKIUKTANCNFSM5G3OFFZA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University
|
Regarding not bundling omw-en31, that's a good idea. Maybe this is as easy as adding |
Regarding |
I did that. Maybe I forgot to push
…On Fri, 5 Nov 2021, 12:03 am Michael Wayne Goodman, < ***@***.***> wrote:
Regarding not bundling omw-en31, that's a good idea. Maybe this is as easy
as adding --exclude="omw-$VERSION/omw-en31/" to the tar command in
.github/workflows/release.yml?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIPZRW4LANAKJUIGY6D3FTUKK4EZANCNFSM5G3OFFZA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
The workflow just calls the scripts. This makes it easier to test things out locally.
Yeah, I didn't see it. No worries, I pushed a commit which does this. It also moves most shell commands into local scripts so you can test things locally. Just be careful with the |
Oh, and the |
This changes this quite a bit for building the wordnets.
index.toml
file which contains metadata (language, citation, etc.) for each wordnet, as well as pointers to the source (.tab
) files.scripts/tsv2lmf.py
is no longer callable directly.scripts/
directory is now a Python package so relative imports work.pip install -r --requirements.txt
) and get required files.python -m scripts.build --version=XYZ [id...]
build/omw-XYZ/...
(not tracked)release/...
(not tracked)scripts/ili-map.tab
,scripts/WN-LMF.dtd
,wns/omw-citations.tab
) because they are unnecessary or can be retrieved during a build.There are still some things to do with the GitHub Actions workflow:
scripts/wndb2lmf
fits