Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Coptic stoplist? #634

Closed
diyclassics opened this issue Jan 29, 2018 · 11 comments · Fixed by #1057
Closed

Add Coptic stoplist? #634

diyclassics opened this issue Jan 29, 2018 · 11 comments · Fixed by #1057

Comments

@diyclassics
Copy link
Collaborator

Coptic does not yet have a language-specific submodule in cltk.stop: https://github.com/cltk/cltk/tree/master/cltk/stop

@osho-agyeya
Copy link
Contributor

I shall take this.

@AMR-KELEG
Copy link
Contributor

AMR-KELEG commented Feb 1, 2021

Hi,
I found this repo https://github.com/cinkova/stopwoRds
It supports coptic as well as many languages.
Is this a good seed stopwords list to start with?

@kylepjohnson
Copy link
Member

Interesting find @AMR-KELEG . Are you interested in doing this one yourself? It should be pretty easy, plus you could learn about our upcoming release.

Steps:

  • Make a new file at dev/src/cltk/stops/.cop.py (here)
  • Make its formatting identical to src/cltk/stops/lat.py (here)
  • In src/cltk/stops/words.py add cop to the imports (here)
  • Add to dict here
  • Add StopsProcess to the CopticPipeline (here)

@AMR-KELEG
Copy link
Contributor

Interesting find @AMR-KELEG . Are you interested in doing this one yourself? It should be pretty easy, plus you could learn about our upcoming release.

Steps:

  • Make a new file at dev/src/cltk/stops/.cop.py (here)
  • Make its formatting identical to src/cltk/stops/lat.py (here)
  • In src/cltk/stops/words.py add cop to the imports (here)
  • Add to dict here
  • Add StopsProcess to the CopticPipeline (here)

Yes, I would love to do it myself.
Thanks for the detailed pointers.
I am just concerned about the precision of this list. Is there a way to double check if each entry is a stopword or not without having knowledge of coptic?

@kylepjohnson
Copy link
Member

I am just concerned about the precision of this list. Is there a way to double check if each entry is a stopword or not without having knowledge of coptic?

🤣

You're asking a fair question. I have found that it is best to code something and then ask for help.

Presumably the people who made this first list are not totally ignorant. To make them better, in the future:

  • Ask Coptic scholars to look at them and validate for us
  • Learn a little about the language ourselves. It does have some inflections so we need to make sure that the variants are in there: https://en.wikipedia.org/wiki/Coptic_language#Grammar
  • There is a little bit of theory behind stopword lists -- some take a statistical approach and remove whatever is very frequent; others take a grammatical approach and remove words w/ low semantics like pronouns (he, you, they) and articles (the, a). For CLTK we have taken the latter approach.

If @AMR-KELEG you're still interested, do you have a date by which you think you could finish this? Take as long as you want; but a soft deadline helps remember me follow up.

If you have any issues w/ our new codebase (use the dev branch), please reach out or email me!

@AMR-KELEG
Copy link
Contributor

AMR-KELEG commented Feb 11, 2021

I am having problems with deadlines recently but I will make sure to work on this PR and provide frequent updates.
I tried installing the R library but I am getting errors with loading a file (I am not a fan of R but I will try to hack the scripts).
On the other hand, I found that the library makes use of the universal dependencies treebank (https://github.com/UniversalDependencies/UD_Coptic-Scriptorium) based on this paper (https://www.aclweb.org/anthology/W18-6022.pdf).
I am thinking of parsing the treebank files to extract tokens of closed class pos tags such as conjunctions (e.g: conjunction token in coptic treebank)
This would only work for separable tokens but for clitics such as the determiner article in determiner token in coptic treebank , it won't be easy to classify these sub-tokens as stopwords without having a word segmentation model for Coptic in cltk.

@kylepjohnson
Copy link
Member

On the other hand, I found that the library makes use of the universal dependencies treebank

This would only work for separable tokens but for clitics such as the determiner article

it won't be easy to classify these sub-tokens as stopwords without having a word segmentation mode

Your plan here might work, but I think it is preferable to find a stopwords list from another and start here.

For splitting words, this is a separate process that could/should be taken care of by the CopticStanzaProcess.

@AMR-KELEG How about you raise an issue on their repo? https://github.com/cinkova/stopwoRds You can reference this issue here and ask for a plaintext version their Coptic stopwords.

@AMR-KELEG
Copy link
Contributor

@AMR-KELEG How about you raise an issue on their repo? https://github.com/cinkova/stopwoRds You can reference this issue here and ask for a plaintext version their Coptic stopwords.

Let's hope we will get a response soon 😅
computationalstylistics/tidystopwords#7

@kylepjohnson
Copy link
Member

@AMR-KELEG Good work!

Did the developer email you the list?

@AMR-KELEG
Copy link
Contributor

@AMR-KELEG Good work!

Did the developer email you the list?

Yes, she did 🎉
coptic_closedPOS.zip

@kylepjohnson
Copy link
Member

@AMR-KELEG

I added a few Coptic stopwords and your code is good. There's probably an issue w/ Stanza's Coptic module … will check it out.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants