Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine Tuning Models #350

Open
farhaanbukhsh opened this issue Aug 11, 2020 · 34 comments
Open

Fine Tuning Models #350

farhaanbukhsh opened this issue Aug 11, 2020 · 34 comments

Comments

@farhaanbukhsh
Copy link

farhaanbukhsh commented Aug 11, 2020

We want to fine-tune 'bert-large-nli-stsb-mean-tokens' on multi-label classification task. So that we can use the output model to get embeddings out.
We have a bunch of sentences classified into labels. The prime question that I want clarification on:

  1. What are the steps to fine tune the models provided by sentence transformer ?
  2. Is there a script or a repo which does this and I can cross reference here ?

The closest we have come to is simpletransformers

But is there a better way to go ahead? Thanks in advance for the help

@farhaanbukhsh
Copy link
Author

@nreimers can you please help us out here? Any help will be highly appreciated, Thanks a lot 😄

@nreimers
Copy link
Member

Hi @farhaanbukhsh
I am not sure what you want / what type of data you have?

Do you want multi-label classification for single sentences? Then I can recommend to use transformers package and to fine-tune the model. Sentence-Transformers is not needed for that.

You want to fine-tune sentence embeddings with a training set where you only have labels and single sentences, and then use the sentence embeddings for example with cosine similarity for tasks like clustering? This is a non-trivial case and sadly there is no single "do this" answer. It heavily depends on what your dataset looks like and how you want to use the sentence embeddings afterwards. You can have a look here, how a dataset with (label, single_sentence) can be used with BatchHardTripletLoss: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_batch_hard_trec_continue_training.py

You have sentence pairs and labels? Then you can maybe use the same setup as for NLI:
https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_nli.py

Best
Nils Reimers

@farhaanbukhsh
Copy link
Author

Hi @farhaanbukhsh
I am not sure what you want / what type of data you have?

Do you want multi-label classification for single sentences? Then I can recommend to use transformers package and to fine-tune the model. Sentence-Transformers is not needed for that.

You want to fine-tune sentence embeddings with a training set where you only have labels and single sentences, and then use the sentence embeddings for example with cosine similarity for tasks like clustering?

yes, exactly what we are trying to do.

This is a non-trivial case and sadly there is no single "do this" answer. It heavily depends on what your dataset looks like and how you want to use the sentence embeddings afterwards. You can have a look here, how a dataset with (label, single_sentence) can be used with BatchHardTripletLoss: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_batch_hard_trec_continue_training.py

You have sentence pairs and labels? Then you can maybe use the same setup as for NLI:
https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_nli.py

Best
Nils Reimers

Thanks a ton @nreimers for replying, means a lot ❤️

What I have is a simple csv file where for a given sentence I have assigned labels, for example:

The heater from amazon was damaged. Labels: ['defective product', 'electric appliance'] so for this do you think https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_batch_hard_trec_continue_training.py

can be used?

@nreimers
Copy link
Member

Yes, for this BatchHardTripletLoss seems a good choice.

For triplet loss, have a look at this good blog article:
https://omoindrot.github.io/triplet-loss

With BatchHardTripletLoss, sentences (texts) that have the same labels will become close in vector space, while sentences with a different label will be further away. At the end you will have clusters in your vector space, e.g. all sentences talking about 'defective product' will be in one space, while sentences with label 'great price' while be somewhere else in vector space.

Best
Nils Reimers

@farhaanbukhsh
Copy link
Author

Yes, for this BatchHardTripletLoss seems a good choice.

For triplet loss, have a look at this good blog article:
https://omoindrot.github.io/triplet-loss

With BatchHardTripletLoss, sentences (texts) that have the same labels will become close in vector space, while sentences with a different label will be further away. At the end you will have clusters in your vector space, e.g. all sentences talking about 'defective product' will be in one space, while sentences with label 'great price' while be somewhere else in vector space.

Best
Nils Reimers

Thanks a loads @nreimers 💯

@farhaanbukhsh
Copy link
Author

Hey @nreimers , we faced a problem the script:

https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_batch_hard_trec_continue_training.py

Doesn't work out of the box, the data source: https://cogcomp.seas.upenn.edu/Data/QA/QC/TREC_10.label

has just one sentence while for

evaluator = TripletEvaluator.from_input_examples(val, name='dev')

to work it needs 3 sentence I guess there is some kind of a confusion to use TripletReader and InputExample being used .

if you could point me the right data source to be used for TripletLoss it would be lot of help. Thanks a lot in advance

@nreimers
Copy link
Member

Hi @farhaanbukhsh
I fixed that example, it should be working now.

The triplets for dev & test have to be created before they are bassed to TripletEvaluator

Best
Nils

@farhaanbukhsh
Copy link
Author

Hi @farhaanbukhsh
I fixed that example, it should be working now.

The triplets for dev & test have to be created before they are bassed to TripletEvaluator

Best
Nils

Thanks a tone for all the work here ❤️

@farhaanbukhsh
Copy link
Author

farhaanbukhsh commented Aug 12, 2020

@nreimers do I need a GPU to run BatchSemiHardTripletLoss.py or I can run without it ?

Getting this error:

AssertionError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

coming from /sentence_transformers/losses/BatchSemiHardTripletLoss.py", line 60, in batch_semi_hard_triplet_loss

@nreimers
Copy link
Member

Fixed that with the latest push, but you would need to install the framework from sources.

Or you use one of the two other batch hard losses. They work without GPU in the latest release

@farhaanbukhsh
Copy link
Author

Got it :) thanks again

@farhaanbukhsh
Copy link
Author

Hey @nreimers ,

I just have few follow up questions, I went through the TripletLoss blog that you pointed out, https://omoindrot.github.io/triplet-loss. This did help a lot, so the questions are:

  1. In the script https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_batch_hard_trec_continue_training.py, we have randomized forming a Triplet, is this done for demonstration purpose or we can use this technique off the shelf?

  2. In case we can't then we need to form our own set of triplets right ? Anyhow is this https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/wikipedia-sections-triplets.zip dataset related to triplet loss and can it be used?

I also went through https://towardsdatascience.com/image-similarity-using-triplet-loss-3744c0f67973 and here it is mentioned we have to curate the Triplet dataset and supply it to the tunning algorithm. So is human curation required here?

@nreimers
Copy link
Member

Hi @farhaanbukhsh
The random triplets are formed only for the dev & test set. For the train set, this is not used.
Random triplet for dev & test set are rather simple to distinguish. Depending on your application, you might need better triplets to fully evaluate the performance of your model.

For training, the application uses what is called "Batch All / Hard / SemiHard Triplets" (also explained in the link). You generate a mini batch with n sentences. It then checks out of the n x n x n possible combinations, which are valid triplets such that anchor and positive have the same label and the negative a different label. It then uses these valid triplets to compute the loss.

The quality (difficulty) of the triplets are quite important to get good results. If the triplets are too easy, the model will not learning something. With the Batch*Triplets approach you already create large combinations of triplets, hence, there designing difficult triplets is not really needed.

Human curation for training is not needed.

Best
Nils Reimers

@ironllamagirl
Copy link

Hi!
Just wanted to inquire about this file https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_batch_hard_trec_continue_training.py

Has it been placed somewhere else? can't seem to open the link

@nreimers
Copy link
Member

https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/other/training_batch_hard_trec_continue_training.py

@thefirebanks
Copy link

Hi @nreimers ! IThank you for the advice written on this thread. It appears that the file https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/other/training_batch_hard_trec_continue_training.py is giving a 404 error.

@farhaanbukhsh We are working on a similar task as you - e.x given a sentence "The tomatoes were rotten" and a set of possible labels ["fruit", "vegetable", "root"] I want to assign one label to the sentence. Did you end up using the triple loss methodology successfully? I was also looking at this thread.

I have 2 questions:

  • When we fined tuned a model using the code from https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli.py, but we found that the resulting model only works in yielding embeddings and not necessarily classifiying text. Is there a way of including the linear layer used for classification in fine tuning, in the saved model? that way when we load the model we can do something like model.predict() instead of just model.encode()?
  • If not, should we load the model embeddings/fine tune using huggingface?

Thank you in advance!

@nreimers
Copy link
Member

@thefirebanks
It was renamed: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/other/training_batch_hard_trec.py

The purpose of SentenceTransformer is to create a model that can generate meaningful embeddings for text. If your task is classification, then using sentence embeddings is in most cases the wrong approach. In that case, CrossEncoder work much better: https://www.sbert.net/examples/training/cross-encoder/README.html

Currently there is no implemented way to include that layer in the model. But you can save it with torch.save() and load it with torch.load() and apply it on top of the sentence embeddings.

But as mentioned, if your task is classification, then using CrossEncoders achieve much better performances.

@noghte
Copy link

noghte commented Jan 25, 2021

@thefirebanks
It was renamed: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/other/training_batch_hard_trec.py

The purpose of SentenceTransformer is to create a model that can generate meaningful embeddings for text. If your task is classification, then using sentence embeddings is in most cases the wrong approach. In that case, CrossEncoder work much better: https://www.sbert.net/examples/training/cross-encoder/README.html

Currently there is no implemented way to include that layer in the model. But you can save it with torch.save() and load it with torch.load() and apply it on top of the sentence embeddings.

But as mentioned, if your task is classification, then using CrossEncoders achieve much better performances.

I have read this thread, and your answers were helpful @nreimers. Thank you!
I also have a sentence classification task, in which each row of my dataset is a long text with a label. But the CrossEncoder example you provide requires sentence pairs. I will try to figure it out, but if you have an example, it would be great to share the link.

@nreimers
Copy link
Member

You can also pass a single text to the CrossEncoder class. It will work withouut changes

@noghte
Copy link

noghte commented Feb 1, 2021

You can also pass a single text to the CrossEncoder class. It will work without changes
I could fine-tune a BERT model using CrossEncoder. However, for prediction, it seems it needs sentence pairs:

You pass to model.predict a list of sentence pairs. Note, Cross-Encoder do not work on individual sentence, you have to pass sentence pairs.
source: https://www.sbert.net/examples/applications/cross-encoder/README.html#cross-encoders-usage

Any idea how to predict the class of a single text and how to find the model accuracy?

Thanks!

@thefirebanks
Copy link

thefirebanks commented Feb 4, 2021

@noghte My guess would be to use the class as the second sentence in the pair, and then just look at the predictions (classes) to evaluate the accuracy. Something that we did before looking at Cross Encoders is the latent embedding approach in this blog:

https://joeddav.github.io/blog/2020/05/29/ZSL.html

Our implementation for a policy project is here.

We will try CrossEncoders soon! But we realized the following:

  • Fine-tuned sentence embeddings by S-BERT given as inputs to more commonly used classifiers such as Random Forests or SVMs can actually give a really good performance in some texts. Hope this helps!

@thefirebanks
Copy link

@noghte Just wanted to check-in, how did you manage to do your classification task in the end with CrossEncoders? Specifically the label prediction for a single sentence?

@nreimers
Copy link
Member

@thefirebanks Have a look here:
https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/cross-encoder/training_nli.py

Instead of passing two sentences, you just pass one sentence

@thefirebanks
Copy link

thefirebanks commented Apr 11, 2021

@nreimers Thank you for your quick reply!

At the top of the file you linked, it says:

It does NOT produce a sentence embedding and does NOT work for individual sentences.

When I try giving one sentence, I build the train dataloader in this way:

train_samples = []
for sent, label in zip(X_train, y_train):
    label_id = label2int[label]
    train_samples.append(InputExample(texts=[sent], label=label_id))

train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)

where sent = "The quick fox did something" and label = 1.

and did something similar for the dev_samples:

dev_samples = []
for sent, label in zip(X_dev, y_dev):
    label_id = label2int[label]
    dev_samples.append(InputExample(texts=[sent], label=label_id))

However, while training the accuracy remains at 0.0 even after 10 epochs.

Am I missing something?

@noghte
Copy link

noghte commented Apr 11, 2021

@noghte Just wanted to check-in, how did you manage to do your classification task in the end with CrossEncoders? Specifically the label prediction for a single sentence?

@thefirebanks I could not use CrossEncoders. Just ignored fine-tuning for now. However, I am planning to try your latent embedding approach later: https://joeddav.github.io/blog/2020/05/29/ZSL.html
Please let me know if you succeed in using CrossEncoders for your task.

@nreimers
Copy link
Member

@thefirebanks
Likely there is some issue how you define the evaluation or which data you pass to it.

As evaluator, you can use: CESoftmaxAccuracyEvaluator

@thefirebanks
Copy link

Thank you, using CESoftmaxAccuracyEvaluator worked!

@noghte I simply replicated the script that @nreimers linked to, and it worked - the only modification was that I needed to switch CEBinaryAccuracyEvaluator by CESoftmaxAccuracyEvaluator

@Sok-Vichea
Copy link

Hi @nreimers I want to use the sentence-transformers with my multi-label classification problems (can be say classification document with multi-label) as I have the pattern of data frame as below:

“sentence 1.sentence2…sentenceN” [‘label1’, ‘label2’, ‘label3’, ’label4’]
“sentence 1.sentence2…sentenceN”. [‘label1’, ‘label2’, ‘label3’]
.......

The first column is the text composed by many sentences and the second column correspond to multi-label (actually there are many labels in total ex: label1, label2,.., labelN ) that each text(many sentences) are assigned to. What might be the right call for using the sentence-transformers in my case to fine-tuning the model? Thank you so much for your reply and your recommendation.

@nreimers
Copy link
Member

Hi @ing-david
hmm, that is not straight forward, as sentence-transformers was not designed for such a task.

You could try to use the cross-encoder with an MSE loss and have an output vector [0, 1, 1, 0, 0, 1] that represents your multi labels. Not sure if this will be possible with the cross-encoder class.

Otherwise you could also embed labels in a vector space and sentences in a vector space and then use a bi-encoder so that sentences with a label are close in vector space to their label. But this is also not easy to be done with sentence-transformers and would require that you take a deeper look at the respective classes and how to use them for your task

@BillManka
Copy link

BillManka commented Jun 25, 2021

Hi @nreimers,

Thanks so much for your work! I'm looking to use the embeddings for clustering in order to better understand a distribution that will undergo regression. I have continuous labels for ~3000 short passages (<300 words).

It seems that the most straightforward way to fine-tune with a labeled set is to use the paired-passages with cosine similarity. Do you have an intuition on whether I might be able to bootstrap those labels by passing my data through the bi-encoder once, and then use my continuous labels to weight these pair labels in order to not drive the distribution too far away from the target?

Thanks,
Bill

@BillManka
Copy link

P.s. - My motivation for clustering (unsupervised) is that the labels of my dataset are very noisy, and I'd like to shed some light on the noise.

@nreimers
Copy link
Member

Hi @BillManka
Not sure what you mean.

What you need are labels to text pairs. Just labels for single texts are not sufficient. Bootstrapping the labels for the text pairs is not really necessary.

@BillManka
Copy link

Aha. Honestly, I think I was just looking for a quick/dirty hack to tune the embeddings to my data a bit more.
Thanks again!

@BillManka
Copy link

I.e. weight the cosine similarity label by some scaled version of the distance between the pair members in the supervised space. I have no theory at all to justify it, it would be completely exploratory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants