Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bias in TF-IDF + SVM results for SCOTUS #32

Closed
danigoju opened this issue Nov 3, 2022 · 7 comments
Closed

Bias in TF-IDF + SVM results for SCOTUS #32

danigoju opened this issue Nov 3, 2022 · 7 comments

Comments

@danigoju
Copy link
Contributor

danigoju commented Nov 3, 2022

You probably have realised that the results with TF-IDF + SVM approach for SCOTUS are pretty high, well, I think they have a bias. I think that the testing metrics are being computed after a retraining of the Pipeline with both training and validation sets combined, while the other Language Models are only fine-tuned with the training set. This is because sklearn.model_selection.GridSearchCV has the parameter "refit" equal True as default, which ends up in a biased comparison.

Training with only the training set for the best hyperparameters found in the validation set the microf1 score is closer to 74.0 and the macrof1 to 64.4

Reference:

gs_clf = GridSearchCV(text_clf, parameters, cv=val_split, n_jobs=32, verbose=4)

danigoju added a commit to danigoju/lex-glue that referenced this issue Nov 3, 2022
See issue coastalcph#32 for more info about the bias
@iliaschalkidis
Copy link
Collaborator

iliaschalkidis commented Nov 3, 2022

Hi @danigoju, that's a great 🐛 finding! In our paper, we comment on how good TF-IDF + SVM is in SCOTUS, but we could only speculate that's because of its capability to encode longer documents (SCOTUS is by far the task with the largest documents).

I guess this also affects the results in the rest of the tasks, as well? Why it would't?

@danigoju
Copy link
Contributor Author

danigoju commented Nov 4, 2022

Yes, this bug probably overestimates the TF-IDF+SVM testing scores for all the datasets, as it is using a larger proportion of data. I only mentioned SCOTUS because it was the dataset I was working with and the tunnel vision caught me 😅.

@iliaschalkidis
Copy link
Collaborator

Sure, it's the most "extreme", so I understand... Cool, I will rerun all of them and update the paper then. Our faith in deep learning can be restored 🤣 Thanks, again!

@iliaschalkidis
Copy link
Collaborator

Cool, I rerun the experiments and updated the README.md table with the new scores for TFIDF-SVM across all tasks.

@JamesLYC88
Copy link

Hi @danigoju and @iliaschalkidis,

I disagree with the statement that “this bug probably overestimates the TF-IDF+SVM testing scores for all the datasets, as it is using a larger proportion of data" because of the following points.

First, BERT and its variants also use the validation data!
Specifically, BERT-based methods need validation data to determine the stopping condition of the training process.
Further, if one conducts a hyper-parameter tuning, it is necessary to have the validation data for selecting the best hyper-parameters.
Therefore, TF-IDF+SVM methods do not use a “larger" proportion of data because both SVM and BERT-based models use the validation data in their training processes.

Second, re-training the final model using the whole set (training+validation) is a common practice in TF-IDF+SVM methods.
The procedure of training an SVM model is introduced in

  1. Section 1.2 from https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
  2. the flow chart from https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation

In particular, K-fold cross-validation is conducted for each hyper-parameter configuration to find the best hyper-parameters.
Consequently, under the best hyper-parameters, K models are generated in the cross-validation process.
Even if they have been stored, it is unclear which one of the K models should be considered as the final model.
Therefore, it is a reasonable setting to re-train the final model using the best hyper-parameters and the whole set.

Third, in deep learning practices, re-training the final model using the validation set can also be conducted.
Specifically, BERT-based methods in LexGLUE scripts only fit the model on the training set, while the validation set is not “learned" by the model.
We thus illustrate a possible procedure to perform re-training in deep learning.

  1. Conduct a hyper-parameter tuning by training the model on the training split under each hyper-parameter configuration and selecting the best hyper-parameters using the validation data.
  2. Record the number of training steps that leads to the best validation performance during the hyper-parameter search as e*.
  3. Re-train the final model using the best hyper-parameters and the whole set (training+validation) for e* epochs.

Based on the procedure, the results of taking BERT to evaluate SCOTUS are shown as follows.

  • no re-train results (evaluated by the best model after the hyper-parameter search in step 2)
    • Micro-F1: 67.1, Macro-F1: 55.9
  • with re-train results (evaluated by the final model in step 3)
    • Micro-F1: 71.4, Macro-F1: 61.9

The results show conclusive improvements by conducting the re-training process in step 3.
Therefore, it is viable and sometimes beneficial to re-train the final model using the validation set in deep learning practices.

In conclusion, we argue that BERT-based models in LexGLUE also incorporate the validation data in the training process, so it is fair for TF-IDF+SVM and BERT-based methods to access both the training and validation set.
Moreover, practitioners can decide how they use the validation data!
For example, in training SVMs, re-training the final model using the validation set has always been a common process for people to follow.
As for training BERT or other deep learning models, people can use their validation data only to select the best model and control the stopping condition of the training process.
Or, they can perform re-training by either following our proposed procedure or designing other algorithms.

@iliaschalkidis
Copy link
Collaborator

Hi @JamesLYC88,

There is no argument whether Transformer-based models should use the validation subset, while TF-IDF+SVM models shouldn't. The argument is that they should both leverage the validation dataset in a similar fashion. On the original LexGLUE experiments (Chalkidis et al., 2022), we aimed to use the validation subset to tune hyper-parameters (step 2 in your described workflow), and then consider the models with the best validation performance. In reality, we did that for Transformer-based models, but not for TF-IDF+SVM models, since we also unconsciously re-trained the latter (step 3 in your described workflow). The fix from @JamesLYC88 is leading to the same conditions, i.e., train of a fixed training set, tuning based on a fixed validation set, evaluation on validation and test set.

I do not claim -and I guess neither @danigoju does- that re-training is an unfair practice. I claim that is unfair for some models to do so, and for other not to -since training with more samples will most likely lead to better generalization and performance improvements, even with not ideal hyper-parameters- when we compare models in a controlled setting.

I agree that it would be interesting to see if this holds across all models and datasets, by re-training the Transformer-based models using both training and validation subsets. This is something that I currently have no resources to do, but I plan to do in the future with the release of new Transformer-based models and an extended version of LexGLUE, including more tasks:

  • ContractNLI (Koreeda and Manning, 2021),
  • ECtHR Arguments (Habernal et al., 2022),
  • UK-LEX (Chalkidis and Søgaard, 2022),
  • ILCD (Malik et al., 2021).

In the meantime, research on improved training settings is welcome, e.g., data augmentation, or other fair practices that aim to improve performance. Data efficiency is also an interesting topic, since the performance varies across datasets with large difference in the number of samples, so understanding these dynamics would be interesting.

@JamesLYC88
Copy link

Hi @iliaschalkidis ,

Thanks for the reply. It's great to hear you plan to perform the re-training experiments on BERT-based models. For your information, we have already conducted the re-training experiments on LexGLUE and observed a stable improvement in most datasets. Please refer to pages 27-28 from my advisor's presentation slides (https://www.csie.ntu.edu.tw/~cjlin/talks/bloomberg.pdf) for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants