<a href="https://colab.research.google.com/github/andysingal/04-CNNs/blob/main/Scalable_agglomerative_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Large scale agglomerative clustering on millions of sentences

This notebook provides the code for the article Clustering millions of sentences to optimise the ML-workflow. It shows the implementation of the scalable sentence clustering algorithm and an example of clustering 1 million Bing queries from the MS Marco dataset.


# Setup

In [1]:
!curl -O https://msmarco.blob.core.windows.net/msmarcoranking/queries.tar.gz
!tar -xf queries.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 18.0M  100 18.0M    0     0  18.1M      0 --:--:-- --:--:-- --:--:-- 18.1M


In [2]:
%%capture
!pip install sentence_transformers funcy pickle5

In [3]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import math

# Embedding code

In [7]:
def embed_data(data, key='text', model_name='all-MiniLM-L6-v2', cores=1, gpu=False, batch_size=128):
    """
    Embed the sentences/text using the MiniLM language model (which uses mean pooling)
    """
    print('Embedding data')
    model = SentenceTransformer(model_name)
    print('Model loaded')

    sentences = data[key].tolist()
    unique_sentences = data[key].unique()
    print('Unique sentences', len(unique_sentences))

    if cores == 1:
        embeddings = model.encode(unique_sentences, show_progress_bar=True, batch_size=batch_size)
    else:
        devices = ['cpu'] * cores
        if gpu:
            devices = None  # use all CUDA devices

        # Start the multi-process pool on multiple devices
        print('Multi-process pool starting')
        pool = model.start_multi_process_pool(devices)
        print('Multi-process pool started')

        chunk_size = math.ceil(len(unique_sentences) / cores)

        # Compute the embeddings using the multi-process pool
        embeddings = model.encode_multi_process(unique_sentences, pool, batch_size=batch_size, chunk_size=chunk_size)
        model.stop_multi_process_pool(pool)

    print("Embeddings computed")

    mapping = {sentence: embedding for sentence, embedding in zip(unique_sentences, embeddings)}
    embeddings = np.array([mapping[sentence] for sentence in sentences])
  
    return embeddings

# Clustering Code


In [8]:
from collections import defaultdict
import numpy as np
from joblib import Parallel, delayed
from funcy import log_durations
import logging
from tqdm import tqdm
import math
import numpy as np
import torch
from joblib import delayed
from tqdm import tqdm
from torch import Tensor
import pickle5 as pickle
import os


def cos_sim(a: Tensor, b: Tensor):
    """
    Computes the cosine similarity cos_sim(a[i], b[j]) for all i and j.
    :return: Matrix with res[i][j]  = cos_sim(a[i], b[j])
    """
    if not isinstance(a, torch.Tensor):
        a = torch.tensor(np.array(a))

    if not isinstance(b, torch.Tensor):
        b = torch.tensor(np.array(b))

    if len(a.shape) == 1:
        a = a.unsqueeze(0)

    if len(b.shape) == 1:
        b = b.unsqueeze(0)

    a_norm = torch.nn.functional.normalize(a, p=2, dim=1)
    b_norm = torch.nn.functional.normalize(b, p=2, dim=1)
    return torch.mm(a_norm, b_norm.transpose(0, 1))


def get_embeddings(ids, embeddings):
    return np.array([embeddings[idx] for idx in ids])


def reorder_and_filter_cluster(
    cluster_idx, cluster, cluster_embeddings, cluster_head_embedding, threshold
):
    cos_scores = cos_sim(cluster_head_embedding, cluster_embeddings)
    sorted_vals, indices = torch.sort(cos_scores[0], descending=True)
    bigger_than_threshold = sorted_vals > threshold
    indices = indices[bigger_than_threshold]
    sorted_vals = sorted_vals.numpy()
    return cluster_idx, [(cluster[i][0], sorted_vals[i]) for i in indices]


def get_ids(cluster):
    return [transaction[0] for transaction in cluster]


def reorder_and_filter_clusters(clusters, embeddings, threshold, parallel):
    results = parallel(
        delayed(reorder_and_filter_cluster)(
            cluster_idx,
            cluster,
            get_embeddings(get_ids(cluster), embeddings),
            get_embeddings([cluster_idx], embeddings),
            threshold,
        )
        for cluster_idx, cluster in tqdm(clusters.items())
    )

    clusters = {k: v for k, v in results}

    return clusters


def get_embeddings(ids, embeddings):
    return np.array([embeddings[idx] for idx in ids])


def get_clustured_ids(clusters):
    clustered_ids = set(
        [transaction[0] for cluster in clusters.values() for transaction in cluster]
    )
    clustered_ids |= set(clusters.keys())
    return clustered_ids


def get_clusters_ids(clusters):
    return list(clusters.keys())


def get_unclustured_ids(ids, clusters):
    clustered_ids = get_clustured_ids(clusters)
    unclustered_ids = list(set(ids) - clustered_ids)
    return unclustered_ids


def sort_clusters(clusters):
    return dict(
        sorted(clusters.items(), key=lambda x: len(x[1]), reverse=True)
    )  # sort based on size


def sort_cluster(cluster):
    return list(
        sorted(cluster, key=lambda x: x[1], reverse=True)
    )  # sort based on similarity


def filter_clusters(clusters, min_cluster_size):
    return {k: v for k, v in clusters.items() if len(v) >= min_cluster_size}


def unique(collection):
    return list(dict.fromkeys(collection))


def unique_txs(collection):
    seen = set()
    return [x for x in collection if not (x[0] in seen or seen.add(x[0]))]


def write_pickle(data, path):
    os.makedirs(os.path.dirname(path), exist_ok=True)
    with open(path, "wb") as f:
        pickle.dump(data, f, protocol=pickle.HIGHEST_PROTOCOL)


def load_pickle(path):
    with open(path, "rb") as f:
        return pickle.load(f)


def chunk(txs, chunk_size):
    n = math.ceil(len(txs) / chunk_size)
    k, m = divmod(len(txs), n)
    return (txs[i * k + min(i, m) : (i + 1) * k + min(i + 1, m)] for i in range(n))



def online_community_detection(
    ids,
    embeddings,
    clusters=None,
    threshold=0.7,
    min_cluster_size=3,
    chunk_size=2500,
    iterations=10,
    cores=1,
):
    if clusters is None:
        clusters = {}

    with Parallel(n_jobs=cores) as parallel:
        for iteration in range(iterations):
            print("1. Nearest cluster")
            unclustered_ids = get_unclustured_ids(ids, clusters)
            cluster_ids = list(clusters.keys())
            print("Unclustured", len(unclustered_ids))
            print("Clusters", len(cluster_ids))
            clusters = nearest_cluster(
                unclustered_ids,
                embeddings,
                clusters,
                chunk_size=chunk_size,
                parallel=parallel,
            )
            print("\n\n")

            print("2. Create new clusters")
            unclustered_ids = get_unclustured_ids(ids, clusters)
            print("Unclustured", len(unclustered_ids))
            new_clusters = create_clusters(
                unclustered_ids,
                embeddings,
                clusters={},
                min_cluster_size=3,
                chunk_size=chunk_size,
                threshold=threshold,
                parallel=parallel,
            )
            new_cluster_ids = list(new_clusters.keys())
            print("\n\n")

            print("3. Merge new clusters", len(new_cluster_ids))
            max_clusters_size = 25000
            while True:
                new_cluster_ids = list(new_clusters.keys())
                old_new_cluster_ids = new_cluster_ids
                new_clusters = create_clusters(
                    new_cluster_ids,
                    embeddings,
                    new_clusters,
                    min_cluster_size=1,
                    chunk_size=max_clusters_size,
                    threshold=threshold,
                    parallel=parallel,
                )
                new_clusters = filter_clusters(new_clusters, 2)

                new_cluster_ids = list(new_clusters.keys())
                print("New merged clusters", len(new_cluster_ids))
                if len(old_new_cluster_ids) < max_clusters_size:
                    break

            new_clusters = filter_clusters(new_clusters, min_cluster_size)
            print(
                f"New clusters with min community size >= {min_cluster_size}",
                len(new_clusters),
            )
            clusters = {**new_clusters, **clusters}
            print("Total clusters", len(clusters))
            clusters = sort_clusters(clusters)
            print("\n\n")

            print("4. Nearest cluster")
            unclustered_ids = get_unclustured_ids(ids, clusters)
            cluster_ids = list(clusters.keys())
            print("Unclustured", len(unclustered_ids))
            print("Clusters", len(cluster_ids))
            clusters = nearest_cluster(
                unclustered_ids,
                embeddings,
                clusters,
                chunk_size=chunk_size,
                parallel=parallel,
            )
            clusters = sort_clusters(clusters)

            unclustered_ids = get_unclustured_ids(ids, clusters)
            clustured_ids = get_clustured_ids(clusters)
            print("Clustured", len(clustured_ids))
            print("Unclustured", len(unclustered_ids))
            print(
                f"Percentage clustured {len(clustured_ids) / (len(clustured_ids) + len(unclustered_ids)) * 100:.2f}%"
            )

            print("\n\n")
    return clusters


def get_ids(cluster):
    return [transaction[0] for transaction in cluster]


def nearest_cluster_chunk(
    chunk_ids, chunk_embeddings, cluster_ids, cluster_embeddings, threshold
):
    cos_scores = cos_sim(chunk_embeddings, cluster_embeddings)
    top_val_large, top_idx_large = cos_scores.topk(k=1, largest=True)
    top_idx_large = top_idx_large[:, 0].tolist()
    top_val_large = top_val_large[:, 0].tolist()
    cluster_assignment = []
    for i, (score, idx) in enumerate(zip(top_val_large, top_idx_large)):
        cluster_id = cluster_ids[idx]
        if score < threshold:
            cluster_id = None
        cluster_assignment.append(((chunk_ids[i], score), cluster_id))
    return cluster_assignment


def nearest_cluster(
    transaction_ids,
    embeddings,
    clusters=None,
    parallel=None,
    threshold=0.75,
    chunk_size=2500,
):
    cluster_ids = list(clusters.keys())
    if len(cluster_ids) == 0:
        return clusters
    cluster_embeddings = get_embeddings(cluster_ids, embeddings)

    c = list(chunk(transaction_ids, chunk_size))

    with log_durations(logging.info, "Parallel jobs nearest cluster"):
        out = parallel(
            delayed(nearest_cluster_chunk)(
                chunk_ids,
                get_embeddings(chunk_ids, embeddings),
                cluster_ids,
                cluster_embeddings,
                threshold,
            )
            for chunk_ids in tqdm(c)
        )
        cluster_assignment = [assignment for sublist in out for assignment in sublist]

    for (transaction_id, similarity), cluster_id in cluster_assignment:
        if cluster_id is None:
            continue
        clusters[cluster_id].append(
            (transaction_id, similarity)
        )  # TODO sort in right order

    clusters = {
        cluster_id: unique_txs(sort_cluster(cluster))
        for cluster_id, cluster in clusters.items()
    }  # Sort based on similarity

    return clusters


def create_clusters(
    ids,
    embeddings,
    clusters=None,
    parallel=None,
    min_cluster_size=3,
    threshold=0.75,
    chunk_size=2500,
):
    to_cluster_ids = np.array(ids)
    np.random.shuffle(
        to_cluster_ids
    )  # TODO evaluate performance without, try sorted list

    c = list(chunk(to_cluster_ids, chunk_size))

    with log_durations(logging.info, "Parallel jobs create clusters"):
        out = parallel(
            delayed(fast_clustering)(
                chunk_ids,
                get_embeddings(chunk_ids, embeddings),
                threshold,
                min_cluster_size,
            )
            for chunk_ids in tqdm(c)
        )

    # Combine output
    new_clusters = {}
    for out_clusters in out:
        for idx, cluster in out_clusters.items():
            # new_clusters[idx] = unique([(idx, 1)] + new_clusters.get(idx, []) + cluster)
            new_clusters[idx] = unique_txs(cluster + new_clusters.get(idx, []))

    # Add ids from old cluster to new cluster
    for cluster_idx, cluster in new_clusters.items():
        community_extended = []
        for (idx, similarity) in cluster:
            community_extended += [(idx, similarity)] + clusters.get(idx, [])
        new_clusters[cluster_idx] = unique_txs(community_extended)

    new_clusters = reorder_and_filter_clusters(
        new_clusters, embeddings, threshold, parallel
    )  # filter to keep only the relevant
    new_clusters = sort_clusters(new_clusters)

    clustered_ids = set()
    for idx, cluster_ids in new_clusters.items():
        filtered = set(cluster_ids) - clustered_ids
        cluster_ids = [
            cluster_idx for cluster_idx in cluster_ids if cluster_idx in filtered
        ]
        new_clusters[idx] = cluster_ids
        clustered_ids |= set(cluster_ids)

    new_clusters = filter_clusters(new_clusters, min_cluster_size)
    new_clusters = sort_clusters(new_clusters)
    return new_clusters


def fast_clustering(ids, embeddings, threshold=0.70, min_cluster_size=10):
    """
    Function for Fast Clustering

    Finds in the embeddings all communities, i.e. embeddings that are close (closer than threshold).
    """

    # Compute cosine similarity scores
    cos_scores = cos_sim(embeddings, embeddings)

    # Step 1) Create clusters where similarity is bigger than threshold
    bigger_than_threshold = cos_scores >= threshold
    indices = bigger_than_threshold.nonzero()

    cos_scores = cos_scores.numpy()

    extracted_clusters = defaultdict(lambda: [])
    for row, col in indices.tolist():
        extracted_clusters[ids[row]].append((ids[col], cos_scores[row, col]))

    extracted_clusters = sort_clusters(extracted_clusters)  # FIXME

    # Step 2) Remove overlapping clusters
    unique_clusters = {}
    extracted_ids = set()

    for cluster_id, cluster in extracted_clusters.items():
        add_cluster = True
        for transaction in cluster:
            if transaction[0] in extracted_ids:
                add_cluster = False
                break

        if add_cluster:
            unique_clusters[cluster_id] = cluster
            for transaction in cluster:
                extracted_ids.add(transaction[0])

    new_clusters = {}
    for cluster_id, cluster in unique_clusters.items():
        community_extended = []
        for idx in cluster:
            community_extended.append(idx)
        new_clusters[cluster_id] = unique_txs(community_extended)

    new_clusters = filter_clusters(new_clusters, min_cluster_size)

    return new_clusters


# Run

In [9]:
train = pd.read_csv('./queries.train.tsv', sep='\t', names=['id', 'query'])
dev = pd.read_csv('./queries.dev.tsv', sep='\t', names=['id', 'query'])
eval = pd.read_csv('./queries.eval.tsv', sep='\t', names=['id', 'query'])
data = pd.concat([train, dev, eval])

In [16]:
data.shape

(1010916, 2)

In [10]:
ids = data.id

In [11]:
embeddings = embed_data(data, 'query', cores=1)
embeddings = {idx: embedding for idx, embedding in zip(ids, embeddings)}

Embedding data


Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Model loaded
Unique sentences 1010916


Batches:   0%|          | 0/7898 [00:00<?, ?it/s]

Embeddings computed


In [12]:
clusters = {}

In [13]:
clusters = online_community_detection(ids, embeddings, clusters, chunk_size=5000)

1. Nearest cluster
Unclustured 1010916
Clusters 0



2. Create new clusters
Unclustured 1010916


100%|██████████| 203/203 [01:14<00:00,  2.72it/s]
100%|██████████| 2363/2363 [00:00<00:00, 3850.51it/s]





3. Merge new clusters 2363


100%|██████████| 1/1 [00:00<00:00,  9.59it/s]
100%|██████████| 1073/1073 [00:00<00:00, 3602.00it/s]


New merged clusters 1073
New clusters with min community size >= 3 1073
Total clusters 1073



4. Nearest cluster
Unclustured 1004924
Clusters 1073


100%|██████████| 201/201 [00:17<00:00, 11.80it/s]


Clustured 33875
Unclustured 977041
Percentage clustured 3.35%



1. Nearest cluster
Unclustured 977041
Clusters 1073


100%|██████████| 196/196 [00:16<00:00, 11.93it/s]





2. Create new clusters
Unclustured 977041


100%|██████████| 196/196 [01:13<00:00,  2.67it/s]
100%|██████████| 1147/1147 [00:00<00:00, 3909.44it/s]





3. Merge new clusters 1147


100%|██████████| 1/1 [00:00<00:00, 31.58it/s]
100%|██████████| 839/839 [00:00<00:00, 3624.74it/s]


New merged clusters 839
New clusters with min community size >= 3 839
Total clusters 1912



4. Nearest cluster
Unclustured 973807
Clusters 1912


100%|██████████| 195/195 [00:26<00:00,  7.29it/s]


Clustured 53005
Unclustured 957911
Percentage clustured 5.24%



1. Nearest cluster
Unclustured 957911
Clusters 1912


100%|██████████| 192/192 [00:26<00:00,  7.30it/s]





2. Create new clusters
Unclustured 957911


100%|██████████| 192/192 [01:12<00:00,  2.66it/s]
100%|██████████| 750/750 [00:00<00:00, 4084.63it/s]





3. Merge new clusters 750


100%|██████████| 1/1 [00:00<00:00, 54.17it/s]
100%|██████████| 678/678 [00:00<00:00, 3753.71it/s]


New merged clusters 678
New clusters with min community size >= 3 678
Total clusters 2590



4. Nearest cluster
Unclustured 955666
Clusters 2590


100%|██████████| 192/192 [00:33<00:00,  5.81it/s]


Clustured 64859
Unclustured 946057
Percentage clustured 6.42%



1. Nearest cluster
Unclustured 946057
Clusters 2590


100%|██████████| 190/190 [00:33<00:00,  5.71it/s]





2. Create new clusters
Unclustured 946057


100%|██████████| 190/190 [01:10<00:00,  2.69it/s]
100%|██████████| 585/585 [00:00<00:00, 3869.53it/s]





3. Merge new clusters 585


100%|██████████| 1/1 [00:00<00:00, 86.12it/s]
100%|██████████| 553/553 [00:00<00:00, 3823.10it/s]


New merged clusters 553
New clusters with min community size >= 3 553
Total clusters 3143



4. Nearest cluster
Unclustured 944306
Clusters 3143


100%|██████████| 189/189 [00:39<00:00,  4.82it/s]


Clustured 73759
Unclustured 937157
Percentage clustured 7.30%



1. Nearest cluster
Unclustured 937157
Clusters 3143


100%|██████████| 188/188 [00:38<00:00,  4.84it/s]





2. Create new clusters
Unclustured 937157


100%|██████████| 188/188 [01:10<00:00,  2.67it/s]
100%|██████████| 517/517 [00:00<00:00, 3756.99it/s]





3. Merge new clusters 517


100%|██████████| 1/1 [00:00<00:00, 92.77it/s]
100%|██████████| 493/493 [00:00<00:00, 3696.07it/s]


New merged clusters 493
New clusters with min community size >= 3 493
Total clusters 3636



4. Nearest cluster
Unclustured 935603
Clusters 3636


100%|██████████| 188/188 [00:43<00:00,  4.31it/s]


Clustured 81140
Unclustured 929776
Percentage clustured 8.03%



1. Nearest cluster
Unclustured 929776
Clusters 3636


100%|██████████| 186/186 [00:43<00:00,  4.28it/s]





2. Create new clusters
Unclustured 929776


100%|██████████| 186/186 [01:09<00:00,  2.68it/s]
100%|██████████| 555/555 [00:00<00:00, 3625.63it/s]





3. Merge new clusters 555


100%|██████████| 1/1 [00:00<00:00, 63.55it/s]
100%|██████████| 534/534 [00:00<00:00, 3383.81it/s]


New merged clusters 534
New clusters with min community size >= 3 534
Total clusters 4170



4. Nearest cluster
Unclustured 928106
Clusters 4170


100%|██████████| 186/186 [00:48<00:00,  3.85it/s]


Clustured 88583
Unclustured 922333
Percentage clustured 8.76%



1. Nearest cluster
Unclustured 922333
Clusters 4170


100%|██████████| 185/185 [00:47<00:00,  3.89it/s]





2. Create new clusters
Unclustured 922333


100%|██████████| 185/185 [01:07<00:00,  2.74it/s]
100%|██████████| 479/479 [00:00<00:00, 3876.38it/s]





3. Merge new clusters 479


100%|██████████| 1/1 [00:00<00:00, 94.54it/s]
100%|██████████| 466/466 [00:00<00:00, 4123.47it/s]


New merged clusters 466
New clusters with min community size >= 3 466
Total clusters 4636



4. Nearest cluster
Unclustured 920891
Clusters 4636


100%|██████████| 185/185 [00:51<00:00,  3.56it/s]


Clustured 94812
Unclustured 916104
Percentage clustured 9.38%



1. Nearest cluster
Unclustured 916104
Clusters 4636


100%|██████████| 184/184 [00:51<00:00,  3.55it/s]





2. Create new clusters
Unclustured 916104


100%|██████████| 184/184 [01:06<00:00,  2.75it/s]
100%|██████████| 444/444 [00:00<00:00, 4000.20it/s]





3. Merge new clusters 444


100%|██████████| 1/1 [00:00<00:00, 141.02it/s]
100%|██████████| 432/432 [00:00<00:00, 3903.29it/s]


New merged clusters 432
New clusters with min community size >= 3 432
Total clusters 5068



4. Nearest cluster
Unclustured 914764
Clusters 5068


100%|██████████| 183/183 [00:56<00:00,  3.24it/s]


Clustured 100287
Unclustured 910629
Percentage clustured 9.92%



1. Nearest cluster
Unclustured 910629
Clusters 5068


100%|██████████| 183/183 [00:56<00:00,  3.22it/s]





2. Create new clusters
Unclustured 910629


100%|██████████| 183/183 [01:07<00:00,  2.72it/s]
100%|██████████| 388/388 [00:00<00:00, 3911.23it/s]





3. Merge new clusters 388


100%|██████████| 1/1 [00:00<00:00, 115.27it/s]
100%|██████████| 380/380 [00:00<00:00, 4102.01it/s]


New merged clusters 380
New clusters with min community size >= 3 380
Total clusters 5448



4. Nearest cluster
Unclustured 909461
Clusters 5448


100%|██████████| 182/182 [01:00<00:00,  3.02it/s]


Clustured 104882
Unclustured 906034
Percentage clustured 10.37%



1. Nearest cluster
Unclustured 906034
Clusters 5448


100%|██████████| 182/182 [00:59<00:00,  3.07it/s]





2. Create new clusters
Unclustured 906034


100%|██████████| 182/182 [01:06<00:00,  2.73it/s]
100%|██████████| 365/365 [00:00<00:00, 4075.78it/s]





3. Merge new clusters 365


100%|██████████| 1/1 [00:00<00:00, 130.69it/s]
100%|██████████| 357/357 [00:00<00:00, 3967.85it/s]


New merged clusters 357
New clusters with min community size >= 3 357
Total clusters 5805



4. Nearest cluster
Unclustured 904939
Clusters 5805


100%|██████████| 181/181 [01:03<00:00,  2.84it/s]


Clustured 108876
Unclustured 902040
Percentage clustured 10.77%





In [14]:
for cluster in list(clusters.values())[:25]:
  print('\n'.join(data['query'][data.id.isin([tx[0] for tx in cluster])])+'\n\n')

how long cooked eggs in fridge
how long is an unrefrigerated boiled egg good
how long to boiled eggs keep
how long to cook a soft boiled egg
how long do you have to hard boil an egg
how long does hard boiled eggs stay good
how long can you keep refrigerated hard boiled eggs for?
how many minutes to boil an egg
how long to hard boiled eggs
how long do i boil eggs?
how long are boiled eggs good for in the refrigerator
how long will fresh eggs keep
how long.to.boil eggs
how long do eggs boil for
how long will hard boiled eggs last in refrigerator
how long do eggs last room temperature
how long to hard boil an egg
How long can eggs keep at room temperature
how long to cook hard boiled egg from fridge
how long does it take to boil eggs
how long can you keep a boiled egg
how long do you let eggs boil
how long to hard boil eggs last
how long do hard boiled eggs keep fresh
how long to eggs last in the refrigerator
how long for a hard boiled egg to cook
how long to eggs stay good
how long will 

In [15]:
len(clusters)

5805