Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions for annotating a new corpus #993

Open
AngledLuffa opened this issue Nov 18, 2023 · 7 comments
Open

Suggestions for annotating a new corpus #993

AngledLuffa opened this issue Nov 18, 2023 · 7 comments

Comments

@AngledLuffa
Copy link

AngledLuffa commented Nov 18, 2023

I am wondering, what advice is there for starting a new corpus? Is there a guide for doing so?

There is a team in Pakistan at Isra University who would like to see more Sindhi NLP tools in Stanza, and one of the ways we could make that happen is by annotating more raw data. (Currently there is not very much Sindhi in UD.) We're able to find an annotation company with some amount of linguistics knowledge in Sindhi, and of course there are people at Isra would put together a schema, review annotations, and possibly do some annotation as well.

There's already some tokenized data, so that should be taken care of. I believe the next step would be to label it with POS and dependencies.

Would it make sense to:

  • come up with an initial schema for dependencies, possibly with some sentences analyzed
  • pass this guide to the annotators with a portion of the data
  • see what comes back, correct errors when possible
  • use this to produce silver dependencies which the annotation team can correct, hopefully making the task easier

Are there better approaches for getting annotators who may not be very familiar with dependencies to label things? For example, I could also imagine breaking sentences into phrases and then trying to describe the relations between phrases as an easier approach for getting high quality annotations. That almost sounds like constituencies, for that matter, so perhaps it would be easier to build a constituency dataset and convert that to dependencies in some way.

@muteeurahman

@dan-zeman
Copy link
Member

There is already a Sindhi dataset in the UD Github by @mazharaliabro. It has never been released primarily because it does not have dependencies. But it is 675 sentences / 6863 tokens with UPOS tags and some features. I suppose someone could use it to train a tagger and apply it to the new data. It should be checked whether the tokenization is compatible.

Regarding dependencies, I imagine that a parser based on XLM Roberta (it seems to contain Sindhi) and a mixture of existing UD treebanks (in the spirit of Udify) could produce something that the annotators could use.

With unexperienced annotators it may be even more advisable to implement a language-specific validator that will check patterns that the universal validator cannot check.

@dan-zeman
Copy link
Member

starting a new corpus? Is there a guide for doing so?

Yes, there is this. But every language is special and there are huge differences in what resources already exist and can be potentially used.

@meesumalam
Copy link

@AngledLuffa I am working on UD for Saraiki language which is closely related to Sindhi.

I am a PhD student in computational linguistics at Indiana University, and would be happy to share my thoughts in this project. thanks

@AngledLuffa
Copy link
Author

@dan-zeman Thank you for the link and the suggested starting point. I would worry about how much Sindhi data is really in XLM - looking over other multilingual transformers which include Sindhi, they generally have very little raw text. The idea of knowledge transfer from an existing language is an interesting one.

We had noticed the unfinished Sindhi dataset. I'm not sure what the current expectation is in terms of how finished we think the upos tagging & featurization is. Depending on how much we want to use it, there may already be enough to start a tagger. Not having dependencies will be a bit of a limitation at first, I would expect.

@meesumalam Thank you for the suggestion. Would it make sense to connect you directly with @muteeurahman ? I am curious what you've found in terms of raw text for annotating or building language models, especially if you've come across such data in Sindhi. There is a limited amount of data in the common crawl or Wikipedia for Sindhi, and I would expect even less for Saraiki (I don't see it listed in the Oscar version of CC, for example)

@meesumalam
Copy link

Right, Saraiki doesn't have much data as compared to Sindhi.

You reach me out at meealam@iu.edu for further discussion on the topic.

Thank

@muteeurahman
Copy link

muteeurahman commented Nov 23, 2023 via email

@meesumalam
Copy link

meesumalam commented Nov 23, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants