Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor evaluation results on a dataset #107

Closed
anatoly-khomenko opened this issue Jan 20, 2020 · 5 comments
Closed

Poor evaluation results on a dataset #107

anatoly-khomenko opened this issue Jan 20, 2020 · 5 comments

Comments

@anatoly-khomenko
Copy link

Hello @nreimers,

Thank you for amazingly simple to use code!

I'm trying to fine-tune the model 'bert-base-nli-mean-tokens' model to match user searches to job titles.

My training dataset consists of 934791 pairs of sentences and score for each pair, so I use the example for fine-tuning for the STS Benchmark (https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_stsbenchmark_continue_training.py)

I train using the parameters from example (4 epochs with batch size 16).
The evaluation results I'm getting after training are the following:

2020-01-20 15:20:03 - Cosine-Similarity :	Pearson: 0.0460	Spearman: 0.1820
2020-01-20 15:20:03 - Manhattan-Distance:	Pearson: -0.0294	Spearman: 0.0167
2020-01-20 15:20:03 - Euclidean-Distance:	Pearson: -0.0295	Spearman: 0.0169
2020-01-20 15:20:03 - Dot-Product-Similarity:	Pearson: 0.0468	Spearman: 0.1853
0.18530780992075702

Which I believe means that the model has not learned useful embeddings.

Here is how my dataset looks like for one search phrase:
image

The distribution of the score column is:
image
So I would consider this as a balanced dataset.

What would you recommend as the next steps to improve the results?

  1. Continue the training until similarity criteria reach 0.85 as for STS example?
  2. Modify the model adding some layer for search_input encoding (as you recommend here: Is it proper for sentence-phrase pair or sentence-word pair? #96 (comment))

Any other advice would be helpful.

Thank you!

@nreimers
Copy link
Member

Hi @anatoly-khomenko
Some notes:

  1. The STS dataset has scores between 0 - 5. Hence, the STS reader normalizes the scores by dividing them by 5 so that you get scores between 0-1. If you haven't disabled it (you can pass a false as a parameter), your scores would be normalized to the rand 0 - 0.1 (I think 0.5 is your highest score?)

  2. You have an asymmetric use case: It makes a difference what text the query is and what text the response is, i.e., swapping both would make a difference in your case. The models here are optimized for symmetric use case, i.e, sim(A, B) = sim(B, A)

For your task, using an asymmetric structure could be helpful. You add one (or more) dense layers to one part of the network. So for example
A -> BERT -> Mean-Pooling -> Output
B -> BERT -> Mean-Pooling -> Dense -> Output

Even if A and B are identical, B would get a different sentence embedding because one is the query and the other is the document.

  1. You search input appear rather short. Contextualized word embeddings (like ELMo and BERT) show some issues if you have only single terms or when you match for single terms. The issue is the following.
    Document: My cat is black
    Search query: cat

Here, cat in the document and cat in the search query would get different vector representations, making it more challenging to match them. Non-contenxualized word embeddings like GloVe are easier to use in this case, as 'cat' is always mapped to the same point in vector space.

  1. Your score distribution looks quite skewed. Mathematically, cosine-similarity creates a vector space where the similarities are more or less equally distributed. This is especially true if you have a symmetric network structure. With an asymmetric structure, you can counter this issue a little bit. But in general I could image that modeling a skewed score distribution with cosine-similarity is quite hard.

@anatoly-khomenko
Copy link
Author

Hello @nreimers ,

Thank you for the prompt answer!

I have replaced STSDataReader with my own implementation, so the score was between 0 and 1.

Could you point me to an example of modifying the model as you recommend? Shall I create a separate model as in https://github.com/UKPLab/sentence-transformers/tree/master/sentence_transformers/models and make SentenceTransformer use it?

Thank you!

@nreimers
Copy link
Member

Hi @anatoly-khomenko
I'm afraid that creating an asymmetric structure is not straight-forward, as the architecture was more designed for symmetric network structures.

What you can do is to create a new layer derived from the dense models.Dense module (let's call it AsymmetricDense). Your architecture will look like this:
Input -> BERT -> mean pooling -> AsymmetricDense

In AsymmetricDense, in the forward method, you have a special routine depending on a flag of your input:

if features['input_type'] == 'document':
     features.update({'sentence_embedding': self.activation_function(self.linear(features['sentence_embedding']))})
return features

Then you need a special reader. For your queries, you set the feature['input_type'] to 'query', for your documents (your titles), you set it to feature['input_type'] = 'document'.

The dense layer will then only be applied to input text with input_type==document.

@anatoly-khomenko
Copy link
Author

Hello @nreimers ,

Thank you for the detailed comment! I will try to implement that and see what happens.

On a side note, I have filtered my dataset so that searches are longer than 20 symbols and also uised another field as the score. This field is either 0 or 1 in most cases.
This was done in a hope that for the same search phrase it will make positive and negative examples as far in vector space as possible.
Though by error I did not remove around 1% of items that have values greater than 1.

After training for 16 epochs with batch size 64 on the filtered dataset, I am still getting low correlations on the test set and their value began to decrease with time:

image

Distribution of score, in this case, is around 60/30:

image

Probably even with the asymmetric model I would not get good embeddings, due to some other dataset properties that I do not understand for now.

What are the other important properties of the dataset (except for symmetry) that might make the model perform poorly?

Thank you!

@nreimers
Copy link
Member

Hi @anatoly-khomenko
If your gold labels are only 0/1, then computing spearman correlation makes not so much sense. An accuracy measure with e.g. a threshold would make more sense: All scores above 0.5 are mapped to label 1, all others to label 0. Then you can compute precision / recall / accuracy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants