-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange RT2KB and Typing scores #226
Comments
Thanks for that question. I can only give a general answer since you have uploaded a larger dataset. I think uploading an example with a single document for which the evaluation results are different would give use an easier way of comparing the evaluations 😉 In general, RT2KB does the following:
From the results for these two single steps, you can see that the benchmarked system has got 0.76 F1-measure for each step. So the combination of both can not get more than that and most probably will have a lower F1-measure since correctly identified entities might have got a (partly) wrong type. However, I would be happy to dig into this when you can provide a single example with the results from the other scorers 😃 |
Thanks @MichaelRoeder I will check with a single document with more specific details on how to reproduce this with Gerbil and the two other scorers ASAP and share it on this thread :) |
The results with the conlleval scorer was a happy coincidence, because it does not evaluate by "offset" but by "token", so the way it evaluates the recognition is different. Sorry for that. However, the neleval scorer has a similar behavior than RT2KB and still proposes a different result over this single document. Here the GERBIL results and here the TAC output (understood by the neleval scorer): Gold Standard in TAC:
The equivalent in NIF:
System output in TAC:
Here the equivalent NIF output:
The scorer is available here and the command line to run the evaluation is:
And here the output I get:
|
It seems to me that the neleval output:
corresponds to the "Entity Recognition" score provided by GERBIL at http://gerbil.aksw.org/gerbil/experiment?id=201711270022 However, the strong_typed_mention_match SHOULD correspond to the "Entity Typing". Is this the issue? |
No it should correspond to the first line where Basically "strong_typed_mention_match" in neleval == "RT2KB" in GERBIL and "strong_mention_match" in neleval == "Entity Recognition" in GERBIL. The example I gave is a case where the score of extraction is equal to the score of recognition because all the 7 (up to 11 in total) correct extracted mentions have their proper type attached. Look at the "TP", "FN" and "FP" values they are equals:
|
@jplu thanks for this example. Going through it manually, I have calculated the same result as the
These numbers lead to precision=0.583, recall=0.636 and F1-score=0.609. So what I gathered so far is that GERBIL identifies the cases as they are described in the table above. However, the numbers that are calculated based on these counts are not correct. We will search for the problem and update GERBIL. |
Thanks @MichaelRoeder! Let me know once the bug will be fixed. |
Hi, sorry it took me so long. Much todo right now. Is there an open endpoint or could you provide me the ADEL webservice url? (here or DM) |
@TortugaAttack I have reproduced the problem using the two NIF files listed above. You can use the FileBasedNIFDataset for loading the data and the InstanceListBasedAnnotator to load the result file of the annotator and simulate the behaviour of an annotator (you have to make sure that the URIs of the documents in both files are the same - I think the annotator result NIF above has a different URI for the document, the needs to be replaced). Based on that, you should add a JUnit test (you can copy and adapt the SingleRunTest for that). |
Well. I found a problem. If the annotator provides wrong results:
It will be count as I guess it is debatable here if ETyping should acknowledge Recognition too. |
I do not see how this solves the issue since we have to count it as a false positive - as it is done in the table above as well. However, if it solved the problem for you, it might be possible that we count it twice... right? |
no it is not done in the table above. Again:
Is not counted in your table. |
The table is structured by the gold std. entities (11) and the entities of the system answers (12) mapped to them. The last system answer does not match any gold standard (that is the reason for the ---). Apart from that, there are 4 entities from the system that are not exactly matching the gold std (like the "Paris)" example you described). So the table does contain 16 distinct entities 😉 |
For the recognition of entities, there is no difference since we can simply sum up the tp, fp and fn counts. However, for the hierarchical F-measure, this is not possible. When evaluating the typing, we have to compare trees/hierarchies of types which can lead to more than one tp, fp or fn per comparison. Since we want to handle the single entities equal, GERBIL calculates the precision, recall and F1-measure for every entity (can be found in the table above). The averages of these values are the precision, recall and F1-measure scores for the complete document (for the example above, it is precision=7/16, recall=7/16 and F1-score=7/16). @jplu @rtroncy I know it is not the most intuitive implementation 😃. It is arguable whether it is okay to have a "missed" entity not only counted as fn but as precision and recall = 0 and count the (nearly matching) fp entity again with precision and recall = 0. The only alternative that I can think of is a complicated weighting of the hierarchical tp, fp and fn counts to ensure that entities with a complex type hierarchy don't have a larger influence on the result compared to entities with an "easy" set of types. |
Thanks @MichaelRoeder and @TortugaAttack. I can perfectly understand your concerns about the scoring I raised but it is more to be aligned with the known and popular neleval scorer. Personally I think that the annotation:
Must be count as "false positive" AND "false negative" (if the system do not propose nested entities) because the offset do not match, and then the type even if it is good one should not be taken as true positive but also as "false positive" AND "false negative" in the recognition step. This is how works neleval, and I'm ok with that because it seems logic to me. Please, can you let me know once the fix will be pushed to the public instance of GERBIL? I will rerun my script for scoring and then compare between GERBIL and neleval. |
Of cause, we will let you know. However, I think we still have a small misunderstanding. Let's focus on the "Paris" / "Paris)" example. I totally agree that the recognition step has to count this as fp AND fn. I think there is no discussion regarding this point 😉 |
Hello,
The RT2KB and Typing process gives strange scores compared to other scorers. Every time I run a RT2KB process, on a NIF dataset, I always get the exact same score for Precision, Recall and F1, which is quite odd (see this example). If I evaluate the same output with two other scorers (neleval and conlleval) I get the same results with both scorers that are much higher than what RT2KB can gives me (P = 0.717, R = 0.765, F1 = 0.740).
The description of RT2KB says "the annotator gets a text and shall recognize the entities inside and their types", consequently I'm curious to know how the three measures can be equals for Typing when they are different for Recognition.
Any light on this would be welcomed :)
Thanks!
The text was updated successfully, but these errors were encountered: