Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create module and accept other formats #12

Closed
ivyleavedtoadflax opened this issue Jul 11, 2019 · 8 comments
Closed

Create module and accept other formats #12

ivyleavedtoadflax opened this issue Jul 11, 2019 · 8 comments

Comments

@ivyleavedtoadflax
Copy link
Contributor

Hi @davidsbatista, good to meet you the other day!

So I've done a bit more work on this, but have not PRed to here yet. I created a module structure and added CI/CD here: MantisAI/nervaluate#1. Not sure if you want me to PR into this repo - I'm happy to, but it will start to move away from the codebase referred to in the blog post; I guess that is fine with good docs?

Next thing I intend to work on is accepting different formats. So far the tool accepts two lists of tags, and then converts them to namedtuples.

The format that prodigy uses is json based (below), and very similar to the named tuple you used originally:

prodigy_format = {
    "text": "Apple",
    "spans": [{"start": 0, "end": 5, "label": "ORG"}],
}

# current named tuple:

Entity = namedtuple("Entity", "e_type start_offset end_offset")
Entity("ORG", 0, 5)

I have been considering switching away from the named-tuple, and just using the prodigy json format in the package. This would mean that we can rely on other converters (form CoNLL -> prodigy json for example) if they exist, or publish them if they do not, and it all will tie in with the spacy/prodigy ecosystem. What do you think? Are you strongly attached to the namedtuples? or is there something that I am overlooking?

@davidsbatista
Copy link
Owner

Looks good Matt!

I'm all for building a useful PIP package out of the blog post code, and keep the blog post + code as it is. I guess, moving to a new repo is totally fine my be, also you are giving much more input that I am, so it's only fair that into goes into your own repo.

I used named-tuples because at the time because I just wanted some light-structure to hold entities essential information.

When you mentioned JSONs, you mean the input for the Evaluator or the structure inside used by the Evaluator? If it's the inside-structure, you always then need to convert the JSONs to dict or something else, no?

I would keep the named-tuple and just try to see what formats are most used/outputted by NLP libs/SequenceTaggers, and write/add code to the Evaluator so that it can consume and transform these formats into it's internal structure.

Some other suggestions:

Is there an easy way to add these two papers:

in the description or somewhere in the setup.cfg file? Or maybe somewhere in the documentation? I just implemented the ideas suggested in these papers, they deserve to be credit.

Just a bit of cleaning, and changing my email address:
author="David Batista and Matthew Upson"
author_email="david.batista@gmail.com matthew.a.upson@gmail.com"

@ivyleavedtoadflax
Copy link
Contributor Author

Sure I will add these things in.

Yes you're right, I mean dicts (not jsons). I've implemeted what I was talking about here: MantisAI/nervaluate#2

@davidsbatista
Copy link
Owner

If you prefer dicts over named-tuples and it's working go ahead, I don't mind :)

Shall we start gathering somewhere a list of possible output formats and how to parse them into this Evaluator? First one that comes to mind is CoNNL.

@ivyleavedtoadflax
Copy link
Contributor Author

Good idea. Perhaps I should break the fork, so we can add issues to the repo itself instead of here?

@ivyleavedtoadflax
Copy link
Contributor Author

will also grant you access!

@davidsbatista
Copy link
Owner

yes, break the fork, and let's build a proper pip package on your repo.

@ivyleavedtoadflax
Copy link
Contributor Author

ivyleavedtoadflax commented Jul 17, 2019

@davidsbatista great, link broken. I'll start an issue for a list of formats 👍

MantisAI/nervaluate#3

@ivyleavedtoadflax
Copy link
Contributor Author

Moved to MantisAI/nervaluate#3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants