Skip to content
This repository has been archived by the owner on Mar 1, 2024. It is now read-only.

About efficiency of the model #80

Open
AOZMH opened this issue Apr 11, 2021 · 21 comments
Open

About efficiency of the model #80

AOZMH opened this issue Apr 11, 2021 · 21 comments

Comments

@AOZMH
Copy link

AOZMH commented Apr 11, 2021

Hi,

Thanks for the great repo, I enjoy a lot exploring it!
However, when I tried to run the code in the "Use BLINK in your codebase" chapter in README, I found the speed of running the model relatively slow (in fast=False mode).
To be more specific, when I execute "main_dense.run", the first stage of processing proceeded relatively slow (~2.5 seconds per item) while the later stage (printing "Evaluation") proceeded ~ 5 items per second. Also, I tried adding indices as below.

config = {
    ...
    "faiss_index": "flat",
    "index_path": models_path+"faiss_flat_index.pkl"
}

However, the performance of the first stage became even worse (~20 seconds per item). I'm wondering if I'm setting something wrong (especially for the faiss index) which resulted in the low speed. If there are any corrections/methods to speed up? Thanks for your help!
(I'll post the performance logs below if needed!)

@ledw
Copy link
Contributor

ledw commented Apr 12, 2021

Hi @AOZMH,
Thanks for reporting. It is known that the (fast=False) mode is slow, however, the faiss index should be faster. Could you provide a code snippet? Thanks.

@AOZMH
Copy link
Author

AOZMH commented Apr 13, 2021

Thanks for the follow-up, I'll provide my code snippet and the execution logs (time costs) shortly after!

@wutaiqiang
Copy link

Using Biencoder+Crossencoder is much slow than Biencoder only.

@wutaiqiang
Copy link

Since the transformer is O(n^2), we can infer that the T(crossencoder):T(biencoder)=(n+n)^2:2*n^2=2:1

@wutaiqiang
Copy link

Roughly, we can know that T(cross+bi) : T(bi) = 3:1, ignoring the other process

@AOZMH AOZMH closed this as completed May 6, 2021
@AOZMH AOZMH reopened this May 6, 2021
@AOZMH
Copy link
Author

AOZMH commented May 6, 2021

Roughly, we can know that T(cross+bi) : T(bi) = 3:1, ignoring the other process

  • Thanks for the theoretical analysis! That's a bit different from the case I encountered, in which the bi-encoder phase was quite slow and adding an index turned to be even slower.
  • Actually, when I read through the code, I found that the bi-encoder phase was completely running on CPU (including a transformer encoding of the sentence and a matrix multiplication to calculate the similarity scores of each entity), resulting in the slow execution compared with the cross-encoder which totally ran on GPU.
  • I also manually changed code to set the bi-encoder phase to run on GPU, which resulted in a GPU out-of-memory due to the LARGE multiplication (hidden_dim * num_entities). As far as I can tell, the execution requires at least 24GB GPU memory for fp32, which exceeds the limit of my 16GB P100.
  • To solve that, I tried to turn all calculations to fp16, which did not induce OOM, but resulted in a warning telling me that large matmul of fp16 had bugs (see this). Actually, the bi-encoder results of fp16 was also totally wrong, e.g. giving a few incorrect entities high scores.
  • Finally, I manually pruned the entities to fit in 16GB memory at fp32 and everything was fine, the relative time of bi-encoder and cross-encoder turned to be ~20ms and ~100ms, which is reasonable for me.

To wrap up, I conclude that:

  1. I guess it was the high memory cost that led the contributors to turn bi-encoder phase to CPU (which requires RAM instead of GPU memory), but that significantly harms performance, as we all know, transformers on CPU are weigh more slow than on GPU.
  2. To smoothly run BLINK fully on one GPU, I think we need a >24GB GPU memory, which I guess is not a scarcity for FAIR, but somehow poses difficulty for students like me :)
  3. What I still haven't solved is the slow execution of faiss-index, which was even slower than the pure-cpu bi-encoder execution. Maybe someone can give some additional comments?

Thanks for all the help! I'll be happy to follow any updates.

@AOZMH
Copy link
Author

AOZMH commented May 6, 2021

Hi @AOZMH,
Thanks for reporting. It is known that the (fast=False) mode is slow, however, the faiss index should be faster. Could you provide a code snippet? Thanks.

My code snippet was the same as the one in README, as shown below.

import blink.main_dense as main_dense
import argparse

models_path = "models/" # the path where you stored the BLINK models

config = {
    "test_entities": None,
    "test_mentions": None,
    "interactive": False,
    "top_k": 10,
    "biencoder_model": models_path+"biencoder_wiki_large.bin",
    "biencoder_config": models_path+"biencoder_wiki_large.json",
    "entity_catalogue": models_path+"entity.jsonl",
    "entity_encoding": models_path+"all_entities_large.t7",
    "crossencoder_model": models_path+"crossencoder_wiki_large.bin",
    "crossencoder_config": models_path+"crossencoder_wiki_large.json",
    "fast": False, # set this to be true if speed is a concern
    "output_path": "logs/" # logging directory
}

args = argparse.Namespace(**config)

models = main_dense.load_models(args, logger=None)

data_to_link = [ {
                    "id": 0,
                    "label": "unknown",
                    "label_id": -1,
                    "context_left": "".lower(),
                    "mention": "Shakespeare".lower(),
                    "context_right": "'s account of the Roman general Julius Caesar's murder by his friend Brutus is a meditation on duty.".lower(),
                },
                {
                    "id": 1,
                    "label": "unknown",
                    "label_id": -1,
                    "context_left": "Shakespeare's account of the Roman general".lower(),
                    "mention": "Julius Caesar".lower(),
                    "context_right": "'s murder by his friend Brutus is a meditation on duty.".lower(),
                }
                ]

_, _, _, _, _, predictions, scores, = main_dense.run(args, None, *models, test_data=data_to_link)

Also, I changed the config to below to add faiss-index.

config = {
    "test_entities": None,
    "test_mentions": None,
    "interactive": False,
    "top_k": 10,
    "biencoder_model": models_path+"biencoder_wiki_large.bin",
    "biencoder_config": models_path+"biencoder_wiki_large.json",
    "entity_catalogue": models_path+"entity.jsonl",
    "entity_encoding": models_path+"all_entities_large.t7",
    "crossencoder_model": models_path+"crossencoder_wiki_large.bin",
    "crossencoder_config": models_path+"crossencoder_wiki_large.json",
    "fast": False, # set this to be true if speed is a concern
    "output_path": "logs/", # logging directory
    "faiss_index": "flat",
    "index_path": models_path+"faiss_flat_index.pkl"
}

@shahjui2000
Copy link

Hey, how did you manually change biencoder to gpu? Could you share the snippets?

@AOZMH
Copy link
Author

AOZMH commented Jun 2, 2021

Hey, how did you manually change biencoder to gpu? Could you share the snippets?

You can simply revert this commented code to revert the transition to GPU (# .to(device) => .to(device)) and manually put the corresponding model input tensors to GPU & that should work.

It may need a few days for me to clean up my (experimental) codes so maybe you can give it a try given the aforementioned ideas; as far as I can recall, that requires <20 lines of code changes. Anyway, if you still have any problem please feel free to reply me & I'll try to embark on my code snippet.

@shahjui2000
Copy link

shahjui2000 commented Jun 2, 2021

I am facing the same memory issue you did. Could you elaborate your third point - On how to lower memory usage?

@AOZMH
Copy link
Author

AOZMH commented Jun 3, 2021

I am facing the same memory issue you did. Could you elaborate your third point - On how to lower memory usage?

I tried two ways as follow:

  1. Prun the candidate entity set. This code refers to the vector representations of all entities in Wikipedia (of shape
    <num_candidates, hidden_dim>). Such matrix is loaded by the load_models function as utilized by README. So, you can arbitrarily prune that matrix to a smaller set of entities, which would eliminate the OOM. However, as you can see, that's a totally experimental approach just to test the running speed which may lose recall on entity linking (since some entities are excluded from the candidate set).
  2. Execute on FP16. You can turn all corresponding tensors (e.g. candidate_encodings, model_weights, input tensors, etc.) to half-precision floats (16 bit) by a = a.half(). However, as in this discussion, matrix multiplication on very large tensors may result in garbled results, which is exactly the case for me where the entity scores given by the bi-encoder is totally wrong. Maybe you can try out some later versions of CUDA drivers to solve that issue given my relatively outdated CUDA 10.0, fingers crossed for your new findings!

@AOZMH
Copy link
Author

AOZMH commented Jun 3, 2021

BTW, I really hope the developers of BLINK can look into this issue to solve the faiss index problem I mentioned before, thanks in advance!

@shahjui2000
Copy link

I wonder if this is also a feasible solution - Splitting the candidate_encoding when you pass it to gpu and then concatenating the splittled scores and continuing with the code? That way memory passed onto the gpu at each call is reduced without removing entities

@AOZMH
Copy link
Author

AOZMH commented Jun 3, 2021

I wonder if this is also a feasible solution - Splitting the candidate_encoding when you pass it to gpu and then concatenating the splittled scores and continuing with the code? That way memory passed onto the gpu at each call is reduced without removing entities

That should work properly, but the time cost would be considerable: the basic assumption is that the whole candidiate_encoding cannot fit into the gpu memory, now, if you split it into A and B, still, you cannot simultaneously put it into gpu memory. Thus, in that sense, for each execution (instead of each model initialization), we need to first transfer A to gpu, execute on A, then we need to delete A from gpu memory and transfer B to gpu and then execute B & delete B from gpu memory. That is, such splitting approach requires a transfer between main memory and gpu memory for each EXECUTION, which would be costly.

However, that should be a good idea if you have multiple gpus, e.g. putting A and B PERMANENTLY on gpu 0 & 1 and then we do not need such per execution transition.

@Jun-jie-Huang
Copy link

Jun-jie-Huang commented Jun 18, 2021

I think a possible solution is to encode all the queries and all the candidates with GPU and save them. Then build a faiss index with cpu to find the nearest entities. Faiss is much more efficient with satisfying results. But this costs much efforts and means the pipeline will be reconstructed.

BTW, if you use one 32GB V100, the problem will not occur.

@shixiao9941
Copy link

shixiao9941 commented Jun 25, 2021

Hey, how did you manually change biencoder to gpu? Could you share the snippets?

You can simply revert this commented code to revert the transition to GPU (# .to(device) => .to(device)) and manually put the corresponding model input tensors to GPU & that should work.

It may need a few days for me to clean up my (experimental) codes so maybe you can give it a try given the aforementioned ideas; as far as I can recall, that requires <20 lines of code changes. Anyway, if you still have any problem please feel free to reply me & I'll try to embark on my code snippet.

Hi @AOZMH, could you please share your snippets of using gpu for biencoder? I did that but the speed is still slow, maybe I did wrongly...

@shahjui2000
Copy link

Hey, same for me too! Changing it for GPU was actually slower than that of CPU

@tomtung
Copy link

tomtung commented Jul 23, 2021

Btw patches #89 and #90 might help. To enable GPU you can try edit your biencoder_wiki_large.json and crossencoder_wiki_large.json files to set no_cuda to false.

@rajharshiitb
Copy link

Hi @AOZMH, can you please share how you converted the BLINK to work on FP16. I am getting errors.
Thanks

@amelieyu1989
Copy link

I met the same problem. when I add faiss index path, it becomes slower

@abhinavkulkarni
Copy link

Hi,

You may want to make some changes to the codebase and add support for more sparse indexes. Currently, BLINK codebase only supports flat indices.

I am currently using a sparse index OPQ32_768,IVF4096,PQ32x8 built on candidate encodings and the speed improvement is significant.

For e.g., this is what my faiss_indexer.py looks like.

This is how I load the models.

config = {
    "interactive": False,
    "fast": False,
    "top_k": 8,
    "biencoder_model": models_path + "biencoder_wiki_large.bin",
    "biencoder_config": models_path + "biencoder_wiki_large.json",
    "crossencoder_model": models_path + "crossencoder_wiki_large.bin",
    "crossencoder_config": models_path + "crossencoder_wiki_large.json",
    "entity_catalogue": models_path + "entities_aliases_with_ids.jsonl",
    "entity_encoding": models_path + "all_entities_aliases.t7",
    "faiss_index": "OPQ32_768,IVF4096,PQ32x8",
    "index_path": models_path + "index_opq32_768_ivf4096_pq32x8.faiss",
    "output_path": "logs/",  # logging directory
}

self.args = argparse.Namespace(**config)

logger.info("Loading BLINK model...")
self.models = main_dense.load_models(self.args, logger=logger)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants