Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different Classification Reports for IOBES and BILOU #71

Closed
rsuwaileh opened this issue Nov 14, 2020 · 3 comments
Closed

Different Classification Reports for IOBES and BILOU #71

rsuwaileh opened this issue Nov 14, 2020 · 3 comments
Labels
question Further information is requested

Comments

@rsuwaileh
Copy link

rsuwaileh commented Nov 14, 2020

I compared the results of the same model on the same test data with both IOBES and BILOU schemes. I get exactly the same precision, recall, and F1 scores which I expect:

Precision = 0.6762295081967213
Recall = 0.5045871559633027
F1 = 0.5779334500875658

However, I get different classification reports as shown below! Any explanation for this?
BILOU:

              precision    recall  f1-score   support

         LOC      0.676     0.505     0.578       327

   micro avg      0.676     0.505     0.578       327
   macro avg      0.676     0.505     0.578       327
weighted avg      0.676     0.505     0.578       327

IOBES:

              precision    recall  f1-score   support

         LOC      0.667     0.503     0.574       314

   micro avg      0.667     0.503     0.574       314
   macro avg      0.667     0.503     0.574       314
weighted avg      0.667     0.503     0.574       314

My Environment

  • Operating System: Windows 10
  • Python Version: 3.8.3
  • Package Version: 1.2.2
@Hironsan
Copy link
Member

Hironsan commented Nov 14, 2020

Please show me the evaluation snippet and the data.

@rsuwaileh
Copy link
Author

I generated a small example from my dataset:

z_true = [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'E-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'E-LOC', 'O', 'S-LOC', 'O', 'S-LOC', 'O', 'O', 'O', 'O', 'O', 'O'], 
['O', 'O', 'O', 'O', 'O', 'O', 'S-LOC', 'S-LOC', 'O', 'O', 'O', 'O', 'O'], 
['O', 'O', 'B-LOC', 'I-LOC', 'E-LOC', 'O', 'S-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'],
['O', 'B-LOC', 'E-LOC', 'O', 'O', 'B-LOC', 'E-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]

z_pred = [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'E-LOC', 'B-LOC', 'I-LOC', 'E-LOC', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'E-LOC', 'O', 'S-LOC', 'O', 'S-LOC', 'O', 'O', 'O', 'O', 'O', 'O'], 
['O', 'O', 'O', 'O', 'O', 'O', 'S-LOC', 'S-LOC', 'B-LOC', 'I-LOC', 'E-LOC', 'O', 'O'], 
['O', 'S-LOC', 'O', 'O', 'O', 'O', 'S-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], 
['O', 'O', 'O', 'B-LOC', 'E-LOC', 'B-LOC', 'E-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
scheme = IOBES
average = "micro"
evaluate(z_true, z_pred, scheme, average)

The results I get:

0.6666666666666666	0.8	0.7272727272727272
              precision    recall  f1-score   support

         LOC      0.667     0.800     0.727        10

   micro avg      0.667     0.800     0.727        10
   macro avg      0.667     0.800     0.727        10
weighted avg      0.667     0.800     0.727        10

When I change the scheme to BILOU using the same example and lables above:

z_true = [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'L-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'L-LOC', 'O', 'U-LOC', 'O', 'U-LOC', 'O', 'O', 'O', 'O', 'O', 'O'], 
['O', 'O', 'O', 'O', 'O', 'O', 'U-LOC', 'U-LOC', 'O', 'O', 'O', 'O', 'O'], 
['O', 'O', 'B-LOC', 'I-LOC', 'L-LOC', 'O', 'U-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'],
['O', 'B-LOC', 'L-LOC', 'O', 'O', 'B-LOC', 'L-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]

z_pred = [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'L-LOC', 'B-LOC', 'I-LOC', 'L-LOC', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'L-LOC', 'O', 'U-LOC', 'O', 'U-LOC', 'O', 'O', 'O', 'O', 'O', 'O'], 
['O', 'O', 'O', 'O', 'O', 'O', 'U-LOC', 'U-LOC', 'B-LOC', 'I-LOC', 'L-LOC', 'O', 'O'], 
['O', 'U-LOC', 'O', 'O', 'O', 'O', 'U-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], 
['O', 'O', 'O', 'B-LOC', 'L-LOC', 'B-LOC', 'L-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
scheme = BILOU
average = "micro"
evaluate(z_true, z_pred, scheme, average)

I get the same P, R, & F1. However, the report is different. I'm using micro average with both schemes:

0.6666666666666666	0.8	0.7272727272727272
              precision    recall  f1-score   support

         LOC      0.625     0.556     0.588         9

   micro avg      0.625     0.556     0.588         9
   macro avg      0.625     0.556     0.588         9
weighted avg      0.625     0.556     0.588         9

This is the evaluate function that uses seqeval:

def evaluate(y_true, y_pred, scheme, average):
    print(precision_score(y_true, y_pred, average = average, mode='strict', scheme=scheme), end='\t')
    print(recall_score(y_true, y_pred, average = average, mode='strict', scheme=scheme), end='\t')
    print(f1_score(y_true, y_pred, average = average, mode='strict', scheme=scheme))
    print(classification_report(y_true, y_pred, digits=3))

@Hironsan
Copy link
Member

You just forgot to specify mode and scheme to classification_report. If it's specified correctly, the result is the same:

def evaluate(y_true, y_pred, scheme, average):
    print(precision_score(y_true, y_pred, average=average, mode='strict', scheme=scheme), end='\t')
    print(recall_score(y_true, y_pred, average=average, mode='strict', scheme=scheme), end='\t')
    print(f1_score(y_true, y_pred, average=average, mode='strict', scheme=scheme))
    print(classification_report(y_true, y_pred, digits=3, mode='strict', scheme=scheme))

# IOBES
0.6666666666666666      0.8     0.7272727272727272
              precision    recall  f1-score   support

         LOC      0.667     0.800     0.727        10

   micro avg      0.667     0.800     0.727        10
   macro avg      0.667     0.800     0.727        10
weighted avg      0.667     0.800     0.727        10

# BILOU
0.6666666666666666      0.8     0.7272727272727272
              precision    recall  f1-score   support

         LOC      0.667     0.800     0.727        10

   micro avg      0.667     0.800     0.727        10
   macro avg      0.667     0.800     0.727        10
weighted avg      0.667     0.800     0.727        10

@Hironsan Hironsan added the question Further information is requested label Nov 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants