## Analysis seqeval classification_report output

In [1]:
from seqeval.metrics import classification_report
from seqeval.metrics import precision_score, recall_score, f1_score
from seqeval.scheme import IOB2
from seqeval.metrics import sequence_labeling
import numpy as np

The following `y_pred` and `y_true` has been obtained analysing where the classification_report varies its output from `model=default` mode to `mode=strict` when adding iteratively the predictions and true_labels of the age entity. 

**REMARK:** There was a file where a `B-age` was labeled as `I-elegibility`, this sample has been omitted to simplify the analysis.

What each mode options makes?
mode: Whether to count correct entity labels with incorrect I/B tags as true positives or not.
        If you want to only count exact matches, pass mode="strict". default: None.

- `Default/Lenient`: Consider as correct if the entity was correctly identified even the prefix B or I is bad.
- `Strict`: To consider a prediction as correct both, the entity and the prefix, have to be correctly identified.

In [2]:
#Lists where both modes of classification_report give the same results
y_pred_1 = [['O', 'O', 'O'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age']]
y_true_1 = [['B-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age']]

#Show the results
print(classification_report(y_true_1, y_pred_1))
print(classification_report(y_true_1, y_pred_1, mode='strict', scheme=IOB2))

report_lenient_1 = classification_report(y_true_1, y_pred_1, output_dict=True)
report_strict_1 = classification_report(y_true_1, y_pred_1, mode='strict', scheme=IOB2, output_dict=True)


              precision    recall  f1-score   support

         age       1.00      0.77      0.87        13

   micro avg       1.00      0.77      0.87        13
   macro avg       1.00      0.77      0.87        13
weighted avg       1.00      0.77      0.87        13

              precision    recall  f1-score   support

         age       1.00      0.77      0.87        13

   micro avg       1.00      0.77      0.87        13
   macro avg       1.00      0.77      0.87        13
weighted avg       1.00      0.77      0.87        13



Now, during the evaluation process the following file that contains `age` tag predicted the tokens `[['age', '(', 'y', ')', '51', '.', '9', '±', '8', '.', '8']]` as: `[['O', 'O', 'I-age', 'O', 'B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age']]`.
Obviously, the true label is `[['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age']]`.

If we add, this new prediction to the prediction and true lists we obtain the following:

In [3]:
added_pred = [['O', 'O', 'I-age', 'O', 'B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age']]
added_true = [['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age']]

y_pred_2 = y_pred_1 + added_pred
y_true_2 = y_true_1 + added_true

#Show the results
print(classification_report(y_true_2, y_pred_2))
print(classification_report(y_true_2, y_pred_2, mode='strict', scheme=IOB2))

              precision    recall  f1-score   support

         age       0.83      0.71      0.77        14

   micro avg       0.83      0.71      0.77        14
   macro avg       0.83      0.71      0.77        14
weighted avg       0.83      0.71      0.77        14

              precision    recall  f1-score   support

         age       0.91      0.71      0.80        14

   micro avg       0.91      0.71      0.80        14
   macro avg       0.91      0.71      0.80        14
weighted avg       0.91      0.71      0.80        14



We see that this new prediction change the output of the reports. But, why strict mode has higher values if it is an approach more restrictive? To analyse that we are going to compute how the internally each report compute the True Positive, False Negative, True Negative and False Positive.

In [4]:
def test_results(y_true, y_pred, TP,FN, FP, mode=None):
    #Proof that computing precision, recall and F1-score from the confusion matrix gives the same results as the seqeval library if not assert an error
    if TP == 0:
        precision = 0
        recall = 0
        f1 = 0
    else:
        precision = TP / (TP + FP)
        recall = TP / (TP + FN)
        f1 = 2 * ((precision * recall) / (precision + recall))

    report = classification_report(y_true, y_pred, mode=mode, scheme=IOB2, output_dict=True)

    #Assert an error if the results are different otherwise print that the results are the same
    assert report['age']['precision'] == precision
    assert report['age']['recall'] == recall
    assert report['age']['f1-score'] == f1
    print('Results are the same with cm and seqeval')


In [5]:
def compute_cm(y_true, y_pred, suffix = False):
    '''
    Compute the confusion matrix for a sequence labeling task
    Args:
        y_true: list of lists of strings, the true labels
        y_pred: list of lists of strings, the predicted labels
        suffix: boolean, whether the entities are prefixed with the entity type
                False: Consider B-entity and I-entity as same entity
                True: Consider B-entity and I-entity as different entities
    Returns:
        TP, FN, FP: integers, the number of True Positives, False Negatives, False Positives
    '''
    if suffix == True:
        mode = 'strict'    
        aux_pred_entities = [sequence_labeling.get_entities(example, suffix=False) for example in y_pred]
        for entity in aux_pred_entities:
            for subentity in entity:
                if subentity[0] != 'age':
                    #It will be considered as FN
                    print(entity)
                    entity.remove(subentity)
                    print(entity)
    else:
        mode = None

    print(f"Mode: {mode}")

    #Get the entities
    true_entities = [sequence_labeling.get_entities(example, suffix=suffix) for example in y_true]
    pred_entities = [sequence_labeling.get_entities(example, suffix=suffix) for example in y_pred]

    #Count the TP, FN, FP and TN
    TP = 0
    FN = 0
    FP = 0
    TN = 0

    for true, pred in zip(true_entities, pred_entities):
        print(f"True: {true} - Predicted: {pred}")
        for entity in true:
            if entity in pred:
                TP += 1
            else:
                FN += 1
        for entity in pred:
            if entity not in true:
                if mode == 'strict':      
                    if entity[0] == 'B':  
                        FP += 1*2  #Everything is computed twice in strict mode due to the split in B- and I-
                else:
                    if entity[0] == 'age': #If the entity is not in the true labels and the entity is age, otherwise it is false negative
                        FP += 1
                    
    print(f"True positives: {TP} - False negatives: {FN} - False positives: {FP}")

    #Test if the results are the same as the seqeval library
    test_results(y_true, y_pred, TP, FN, FP, mode=mode)

    return TP, FN, FP, true_entities, pred_entities

In [6]:
print('Results for the first test')
TP_1_len, FN_1_len, FP_1_len, true_entities_1_len, pred_entities_1_len = compute_cm(y_true_1, y_pred_1)
TP_1_str, FN_1_str, FP_1_str, true_entities_1_str, pred_entities_1_str = compute_cm(y_true_1, y_pred_1, suffix=True)

Results for the first test
Mode: None
True: [('age', 0, 2)] - Predicted: []
True: [('age', 0, 5)] - Predicted: [('age', 0, 5)]
True: [('age', 0, 7)] - Predicted: [('age', 0, 7)]
True: [('age', 0, 2)] - Predicted: []
True: [('age', 0, 4)] - Predicted: []
True: [('age', 0, 5)] - Predicted: [('age', 0, 5)]
True: [('age', 0, 2)] - Predicted: [('age', 0, 2)]
True: [('age', 0, 3)] - Predicted: [('age', 0, 3)]
True: [('age', 0, 8)] - Predicted: [('age', 0, 8)]
True: [('age', 0, 7)] - Predicted: [('age', 0, 7)]
True: [('age', 0, 6)] - Predicted: [('age', 0, 6)]
True: [('age', 0, 2)] - Predicted: [('age', 0, 2)]
True: [('age', 0, 2)] - Predicted: [('age', 0, 2)]
True positives: 10 - False negatives: 3 - False positives: 0
Results are the same with cm and seqeval
Mode: strict
True: [('B', 0, 0), ('I', 1, 2)] - Predicted: []
True: [('B', 0, 0), ('I', 1, 5)] - Predicted: [('B', 0, 0), ('I', 1, 5)]
True: [('B', 0, 0), ('I', 1, 7)] - Predicted: [('B', 0, 0), ('I', 1, 7)]
True: [('B', 0, 0), ('I', 1,



In [7]:
print('Results for the second test')
TP_2_len, FN_2_len, FP_2_len, true_entities_2_len, pred_entities_2_len = compute_cm(y_true_2, y_pred_2)
TP_2_str, FN_2_str, FP_2_str, true_entities_2_str, pred_entities_2_str = compute_cm(y_true_2, y_pred_2, suffix=True)


Results for the second test
Mode: None
True: [('age', 0, 2)] - Predicted: []
True: [('age', 0, 5)] - Predicted: [('age', 0, 5)]
True: [('age', 0, 7)] - Predicted: [('age', 0, 7)]
True: [('age', 0, 2)] - Predicted: []
True: [('age', 0, 4)] - Predicted: []
True: [('age', 0, 5)] - Predicted: [('age', 0, 5)]
True: [('age', 0, 2)] - Predicted: [('age', 0, 2)]
True: [('age', 0, 3)] - Predicted: [('age', 0, 3)]
True: [('age', 0, 8)] - Predicted: [('age', 0, 8)]
True: [('age', 0, 7)] - Predicted: [('age', 0, 7)]
True: [('age', 0, 6)] - Predicted: [('age', 0, 6)]
True: [('age', 0, 2)] - Predicted: [('age', 0, 2)]
True: [('age', 0, 2)] - Predicted: [('age', 0, 2)]
True: [('age', 0, 10)] - Predicted: [('age', 2, 2), ('age', 4, 10)]
True positives: 10 - False negatives: 4 - False positives: 2
Results are the same with cm and seqeval
Mode: strict
True: [('B', 0, 0), ('I', 1, 2)] - Predicted: []
True: [('B', 0, 0), ('I', 1, 5)] - Predicted: [('B', 0, 0), ('I', 1, 5)]
True: [('B', 0, 0), ('I', 1, 7)]

At this point we have obtained a way to compute manually the outputs of the report in both modes. Now, how to identify the reason of the discrepancy? Let see the case considering only the last predictions added, where the problem starts.

In [8]:
print('Results for the added predictions')
TP_3_len, FN_3_len, FP_3_len, true_entities_3_len, pred_entities_3_len = compute_cm(added_true, added_pred)
TP_3_str, FN_3_str, FP_3_str, true_entities_3_str, pred_entities_3_str = compute_cm(added_true, added_pred, suffix=True)

Results for the added predictions
Mode: None
True: [('age', 0, 10)] - Predicted: [('age', 2, 2), ('age', 4, 10)]
True positives: 0 - False negatives: 1 - False positives: 2
Results are the same with cm and seqeval
Mode: strict
True: [('B', 0, 0), ('I', 1, 10)] - Predicted: [('I', 2, 2), ('B', 4, 4), ('I', 5, 10)]
True positives: 0 - False negatives: 2 - False positives: 2
Results are the same with cm and seqeval


Notice that the number of False negatives and False positives changes in each of the cases. We add then to the obtained for the first test and see that the reported values are in each case the ones for the second test.

In [9]:
assert TP_1_len + TP_3_len == TP_2_len
assert FN_1_len + FN_3_len == FN_2_len
assert FP_1_len + FP_3_len == FP_2_len
assert TP_1_str + TP_3_str == TP_2_str
assert FN_1_str + FN_3_str == FN_2_str
assert FP_1_str + FP_3_str == FP_2_str

We can conclude that the problem in the difference is caused by the number of TP, FN, FP. In each of the modes it is computed in different ways and it can cause the precision or recall in strict mode to be higher than in default mode.

--------------------------------------------------------------------------------------------------------------------------------------

### The case of Elegibility entity

This scenario will be used to verify the conclusions with a more difficult scenario where a new entity is included.

In [10]:
y_pred_1 = [['O', 'O', 'O'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age']]
y_true_1 = [['B-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age'], ['B-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age', 'I-age']]

added_pred = [['I-eligibility', 'I-age', 'I-age', 'I-age']]
added_true = [['B-age', 'I-age', 'I-age', 'I-age']]

y_pred_2 = y_pred_1 + added_pred
y_true_2 = y_true_1 + added_true

In [11]:
print('Results for the first test')
TP_1_len, FN_1_len, FP_1_len, true_entities_1_len, pred_entities_1_len = compute_cm(y_true_1, y_pred_1)
TP_1_str, FN_1_str, FP_1_str, true_entities_1_str, pred_entities_1_str = compute_cm(y_true_1, y_pred_1, suffix=True)

Results for the first test
Mode: None
True: [('age', 0, 2)] - Predicted: []
True: [('age', 0, 5)] - Predicted: [('age', 0, 5)]
True: [('age', 0, 7)] - Predicted: [('age', 0, 7)]
True: [('age', 0, 2)] - Predicted: []
True: [('age', 0, 4)] - Predicted: []
True: [('age', 0, 5)] - Predicted: [('age', 0, 5)]
True: [('age', 0, 2)] - Predicted: [('age', 0, 2)]
True: [('age', 0, 3)] - Predicted: [('age', 0, 3)]
True: [('age', 0, 8)] - Predicted: [('age', 0, 8)]
True: [('age', 0, 7)] - Predicted: [('age', 0, 7)]
True positives: 7 - False negatives: 3 - False positives: 0
Results are the same with cm and seqeval
Mode: strict
True: [('B', 0, 0), ('I', 1, 2)] - Predicted: []
True: [('B', 0, 0), ('I', 1, 5)] - Predicted: [('B', 0, 0), ('I', 1, 5)]
True: [('B', 0, 0), ('I', 1, 7)] - Predicted: [('B', 0, 0), ('I', 1, 7)]
True: [('B', 0, 0), ('I', 1, 2)] - Predicted: []
True: [('B', 0, 0), ('I', 1, 4)] - Predicted: []
True: [('B', 0, 0), ('I', 1, 5)] - Predicted: [('B', 0, 0), ('I', 1, 5)]
True: [('B'

In [12]:
print('Results for the second test')
TP_2_len, FN_2_len, FP_2_len, true_entities_2_len, pred_entities_2_len = compute_cm(y_true_2, y_pred_2)
TP_2_str, FN_2_str, FP_2_str, true_entities_2_str, pred_entities_2_str = compute_cm(y_true_2, y_pred_2, suffix=True)

Results for the second test
Mode: None
True: [('age', 0, 2)] - Predicted: []
True: [('age', 0, 5)] - Predicted: [('age', 0, 5)]
True: [('age', 0, 7)] - Predicted: [('age', 0, 7)]
True: [('age', 0, 2)] - Predicted: []
True: [('age', 0, 4)] - Predicted: []
True: [('age', 0, 5)] - Predicted: [('age', 0, 5)]
True: [('age', 0, 2)] - Predicted: [('age', 0, 2)]
True: [('age', 0, 3)] - Predicted: [('age', 0, 3)]
True: [('age', 0, 8)] - Predicted: [('age', 0, 8)]
True: [('age', 0, 7)] - Predicted: [('age', 0, 7)]
True: [('age', 0, 3)] - Predicted: [('eligibility', 0, 0), ('age', 1, 3)]
True positives: 7 - False negatives: 4 - False positives: 1
Results are the same with cm and seqeval
[('eligibility', 0, 0), ('age', 1, 3)]
[('age', 1, 3)]
Mode: strict
True: [('B', 0, 0), ('I', 1, 2)] - Predicted: []
True: [('B', 0, 0), ('I', 1, 5)] - Predicted: [('B', 0, 0), ('I', 1, 5)]
True: [('B', 0, 0), ('I', 1, 7)] - Predicted: [('B', 0, 0), ('I', 1, 7)]
True: [('B', 0, 0), ('I', 1, 2)] - Predicted: []
Tru

  _warn_prf(average, modifier, msg_start, len(result))


In [13]:
print('Results for the added predictions')
TP_3_len, FN_3_len, FP_3_len, true_entities_3_len, pred_entities_3_len = compute_cm(added_true, added_pred)
TP_3_str, FN_3_str, FP_3_str, true_entities_3_str, pred_entities_3_str = compute_cm(added_true, added_pred, suffix=True)

Results for the added predictions
Mode: None
True: [('age', 0, 3)] - Predicted: [('eligibility', 0, 0), ('age', 1, 3)]
True positives: 0 - False negatives: 1 - False positives: 1
Results are the same with cm and seqeval
[('eligibility', 0, 0), ('age', 1, 3)]
[('age', 1, 3)]
Mode: strict
True: [('B', 0, 0), ('I', 1, 3)] - Predicted: [('I', 0, 3)]
True positives: 0 - False negatives: 2 - False positives: 0
Results are the same with cm and seqeval


  _warn_prf(average, modifier, msg_start, len(result))


In [14]:
assert TP_1_len + TP_3_len == TP_2_len
assert FN_1_len + FN_3_len == FN_2_len
assert FP_1_len + FP_3_len == FP_2_len
assert TP_1_str + TP_3_str == TP_2_str
assert FN_1_str + FN_3_str == FN_2_str
assert FP_1_str + FP_3_str == FP_2_str

We see the difference in the support once eligibility appears as happended before. Let see how it affect to the reports

In [15]:
print(classification_report(y_true_2, y_pred_2)) 
print(classification_report(y_true_2, y_pred_2, mode='strict', scheme=IOB2))

              precision    recall  f1-score   support

         age       0.88      0.64      0.74        11
 eligibility       0.00      0.00      0.00         0

   micro avg       0.78      0.64      0.70        11
   macro avg       0.44      0.32      0.37        11
weighted avg       0.88      0.64      0.74        11

              precision    recall  f1-score   support

         age       1.00      0.64      0.78        11

   micro avg       1.00      0.64      0.78        11
   macro avg       1.00      0.64      0.78        11
weighted avg       1.00      0.64      0.78        11



One more time strict mode reports higher values due to the difference in TP, FN and FP. Additionally, it can be highlighted that eligibility appears in default mode with support 0 because it does not have B-eligibility in the predicted labels but it appears. Additionally, it was not count in strict mode because it does not appear in true_labels

--------------------------------------------------------------------------------------------------------------------------------------

### Easier examples

- Default mode

In [17]:
y_true = [['B-age', 'I-age']]
y_pred = [['I-age', 'I-age']]

print(classification_report(y_true, y_pred))
TP_1_len, FN_1_len, FP_1_len, true_entities_1_len, pred_entities_1_len = compute_cm(y_true, y_pred)

              precision    recall  f1-score   support

         age       1.00      1.00      1.00         1

   micro avg       1.00      1.00      1.00         1
   macro avg       1.00      1.00      1.00         1
weighted avg       1.00      1.00      1.00         1

Mode: None
True: [('age', 0, 1)] - Predicted: [('age', 0, 1)]
True positives: 1 - False negatives: 0 - False positives: 0
Results are the same with cm and seqeval


In [18]:
y_true = [['I-age', 'I-age']]
y_pred = [['B-age', 'I-age']]

print(classification_report(y_true, y_pred))
TP_1_len, FN_1_len, FP_1_len, true_entities_1_len, pred_entities_1_len = compute_cm(y_true, y_pred)

              precision    recall  f1-score   support

         age       1.00      1.00      1.00         1

   micro avg       1.00      1.00      1.00         1
   macro avg       1.00      1.00      1.00         1
weighted avg       1.00      1.00      1.00         1

Mode: None
True: [('age', 0, 1)] - Predicted: [('age', 0, 1)]
True positives: 1 - False negatives: 0 - False positives: 0
Results are the same with cm and seqeval


- Strict mode:

In [26]:
y_true = [['B-age', 'I-age']]
y_pred = [['I-age', 'I-age']]

print(classification_report(y_true, y_pred, mode='strict', scheme=IOB2))
TP_1_len, FN_1_len, FP_1_len, true_entities_1_len, pred_entities_1_len = compute_cm(y_true, y_pred, suffix=True)

              precision    recall  f1-score   support

         age       0.00      0.00      0.00         1

   micro avg       0.00      0.00      0.00         1
   macro avg       0.00      0.00      0.00         1
weighted avg       0.00      0.00      0.00         1

Mode: strict
True: [('B', 0, 0), ('I', 1, 1)] - Predicted: [('I', 0, 1)]
True positives: 0 - False negatives: 2 - False positives: 0
Results are the same with cm and seqeval


In [20]:
y_true = [['I-age', 'I-age']]
y_pred = [['B-age', 'I-age']]

print(classification_report(y_true, y_pred, mode='strict', scheme=IOB2))
TP_1_len, FN_1_len, FP_1_len, true_entities_1_len, pred_entities_1_len = compute_cm(y_true, y_pred, suffix=True)

              precision    recall  f1-score   support

         age       0.00      0.00      0.00         0

   micro avg       0.00      0.00      0.00         0
   macro avg       0.00      0.00      0.00         0
weighted avg       0.00      0.00      0.00         0

Mode: strict
True: [('I', 0, 1)] - Predicted: [('B', 0, 0), ('I', 1, 1)]
True positives: 0 - False negatives: 1 - False positives: 2
Results are the same with cm and seqeval


  _warn_prf(average, modifier, msg_start, len(result))


In [21]:
y_true = [['I-age', 'I-age']]
y_pred = [['I-age', 'I-age']]

print(classification_report(y_true, y_pred, mode='strict', scheme=IOB2))
TP_1_len, FN_1_len, FP_1_len, true_entities_1_len, pred_entities_1_len = compute_cm(y_true, y_pred, suffix=True)

ValueError: max() arg is an empty sequence

In this last example, as IOB2 format just consider as entities the ones that start with the prefix B- the `classification_report` function does not recognize any entity. Strict mode does not work well if another scheme (in the example IO) is provided.

--------------------------------------------------------------------------------------------------------------------------------------

### Conclusion

Both modes compute the metrics in a strict way. On the one hand, default mode considers the entities in IO format, i.e., it does not take into account the prefix for the computation of the metrics. On the other hand, as strict mode works in IOB2 format the entities have to start with a B-entity-name tag and to be predicted correctly all the entity has to match in entity-name and prefix.