New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch cos_sim for community_detection? #1023
Comments
Hi @mmaybeno
|
This is great. I'll try it out on my own data to see how it looks. Thank you for the quick reply! |
Excuse me, did this method work? |
I'm still reviewing the results. What I did notice now though, even if it can create scores in batches, it will quickly run out of memory for larger datasets when clustering. So if the implementation has results that are similar, it still can break on larger data, just in a different section of the code :). |
It didn't work for me, I did find this kaggle post doing a batch process in tensorflow. I tried my best to convert it below. Swapping out the def batch_generator(embeddings, batch_size=1000):
start = 0
stop = 0
while stop < embeddings.shape[0] :
stop = stop + batch_size
yield embeddings[start:stop]
start = stop
def batch_cos_sim(emb_a, emb_b, batch_size=500):
cos_scores_batch = torch.zeros((emb_a.shape[0], emb_b.shape[0]), device=emb_a.device)
n_batches = math.ceil(len(emb_a) / batch_size)
for i, i_emb in enumerate(tqdm((batch_generator(emb_a, batch_size=batch_size)), total=n_batches)):
for j, j_emb in enumerate(batch_generator(emb_b, batch_size=batch_size)):
similarity = torch.sum(i_emb.unsqueeze(1) * j_emb, axis=-1)
similarity /= torch.linalg.norm(i_emb.unsqueeze(1), axis=-1) * torch.linalg.norm(j_emb, axis=-1)
cos_scores_batch[
i * batch_size:(i + 1) * min(batch_size, i_emb.shape[0]),
j * batch_size:(j + 1) * min(batch_size, j_emb.shape[0])
] = similarity
return cos_scores_batch def community_detection(embeddings, threshold=0.75, min_community_size=10, init_max_size=1000):
"""
Function for Fast Community Detection
Finds in the embeddings all communities, i.e. embeddings that are close (closer than threshold).
Returns only communities that are larger than min_community_size. The communities are returned
in decreasing order. The first element in each list is the central point in the community.
"""
# Compute cosine similarity scores
cos_scores = batch_cos_sim(embeddings, embeddings) # <--- only changed piece of code
# Minimum size for a community
top_k_values, _ = cos_scores.topk(k=min_community_size, largest=True)
# Filter for rows >= min_threshold
extracted_communities = []
for i in range(len(top_k_values)):
if top_k_values[i][-1] >= threshold:
new_cluster = []
# Only check top k most similar entries
top_val_large, top_idx_large = cos_scores[i].topk(k=init_max_size, largest=True)
top_idx_large = top_idx_large.tolist()
top_val_large = top_val_large.tolist()
if top_val_large[-1] < threshold:
for idx, val in zip(top_idx_large, top_val_large):
if val < threshold:
break
new_cluster.append(idx)
else:
# Iterate over all entries (slow)
for idx, val in enumerate(cos_scores[i].tolist()):
if val >= threshold:
new_cluster.append(idx)
extracted_communities.append(new_cluster)
# Largest cluster first
extracted_communities = sorted(extracted_communities, key=lambda x: len(x), reverse=True)
# Step 2) Remove overlapping communities
unique_communities = []
extracted_ids = set()
for community in extracted_communities:
add_cluster = True
for idx in community:
if idx in extracted_ids:
add_cluster = False
break
if add_cluster:
unique_communities.append(community)
for idx in community:
extracted_ids.add(idx)
return unique_communities There is a slight difference in results doing it in batches. Maybe I'm missing something?
|
Nevermind, creating a large dataset and you start to run into memory problems again 😂 . The batching works, but it still creates a large |
With large datasets it may be best to use some kind of ANN to compute k. Going to look into it more over the weekend and see what I can come up with. Problem with that would be it's an added dependency for this type of operation. |
Maybe you can do dimensionality reduction first and then do matrix multiplication, which might be much better for memory consumption~ |
I ended up using an ANN to approximate the distance search. It works, just not as well. I'm specifically using Annoy since it stores the index on disk and overall is efficient with memory. There is probably something I am missing to make the clusters a little better. But for now this is what I have. import numpy as np
from annoy import AnnoyIndex
from tqdm.notebook import trange, tqdm
def community_detection(
embeddings, threshold=0.75, min_community_size=10, init_max_size=1000
):
"""
Function for Fast Community Detection
Finds in the embeddings all communities, i.e. embeddings that are close (closer than threshold).
Returns only communities that are larger than min_community_size. The communities are returned
in decreasing order. The first element in each list is the central point in the community.
"""
# Compute cosine similarity scores with ANN index
index = AnnoyIndex(embeddings.shape[1], "angular")
for i, v in tqdm(
enumerate(embeddings.cpu().numpy()),
total=embeddings.shape[0],
desc="Building ANN",
):
index.add_item(i, v)
index.build(30) # configure to change results
index.save("test.ann")
## Minimum size for a community
top_k_values = []
for emb_i in tqdm(
range(embeddings.shape[0]), desc="Finding Minimum Community Size"
):
_, distance = index.get_nns_by_item(
emb_i, min_community_size, include_distances=True
)
top_k_values.append(distance)
top_k_values = 1 - np.array(top_k_values)
# Filter for rows >= min_threshold
extracted_communities = []
total_entries = embeddings.shape[0]
for i in tqdm(range(total_entries), desc="Extracting Communities"):
if top_k_values[i][-1] >= threshold:
new_cluster = []
top_idx_large, top_val_large = index.get_nns_by_item(
i, init_max_size, include_distances=True
)
top_val_large = (1 - np.array(top_val_large)).tolist()
if top_val_large[-1] < threshold:
for idx, val in zip(top_idx_large, top_val_large):
if val < threshold:
break
new_cluster.append(idx)
else:
# Iterate over most entries (slow)
# Only searches 50% of possible entries or 10000
# otherwise this can be a very expensive search
min_most_size = min([int(total_entries * 0.5), 10000])
idx_large, val_large = index.get_nns_by_item(
i, min_most_size, include_distances=True
)
val_large = (1 - np.array(val_large)).tolist()
for idx, val in zip(idx_large, val_large):
if val >= threshold:
new_cluster.append(idx)
extracted_communities.append(new_cluster)
# Largest cluster first
extracted_communities = sorted(
extracted_communities, key=lambda x: len(x), reverse=True
)
# Step 2) Remove overlapping communities
unique_communities = []
extracted_ids = set()
for community in extracted_communities:
add_cluster = True
for idx in community:
if idx in extracted_ids:
add_cluster = False
break
if add_cluster:
unique_communities.append(community)
for idx in community:
extracted_ids.add(idx)
return unique_communities |
hello there! Very interesting thread. Sorry if this a noob question, but I have a lot of RAM, much more than my GPU card. Is there a way to move the embeddings to the CPU so that I can compute the clusters without hitting OOM errors? Thanks! |
Yes. When the models return pytorch tensors (
Computation will then be done on the CPU |
Hi @nreimers
I have tested this & created a PR #1603 for the same. |
Fixed with the most recent commit, will be part of version 2.2.1 |
I've been experimenting with the
community_detection
method but noticed I quickly get OOM errors if I use too large of embeddings.Seeing how it uses
cos_sim
to computed all the embedding distances, do you think it would make sense to have the option for batching? I believe you will find other bottlenecks when iterating over the entries, but at least it will complete on larger embeddings.The text was updated successfully, but these errors were encountered: