# Large scale agglomerative clustering on millions of sentences

This notebook provides the code for the article Clustering millions of sentences to optimise the ML-workflow. It shows the implementation of the scalable sentence clustering algorithm and an example of clustering 1 million Bing queries from the MS Marco dataset.


# Setup

In [2]:
%%capture
!pip install sentence_transformers funcy pickle5

In [3]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import math

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

In [4]:
data = pd.read_excel('Cohesity Dataset to be worked upon.xlsx',usecols=['Session Identifier','Activity Detail'],sheet_name="Case Created").rename(columns={'Session Identifier':'id','Activity Detail':'query'})
data.head()

Unnamed: 0,id,query
0,1657928862586145,Remove SQL server from backup | 00848226
1,1657924160996123,No errors SQL Database protection group not ge...
2,1657917016006422,Sensor reading critical | 00848124
3,1657916928916517,Helios scheduled reports not being received co...
4,1657917583286324,"CE00101009, CE00101115,CE00102203,CE00113014 ..."


# Embedding code

In [5]:
def embed_data(data, key='text', model_name='all-MiniLM-L6-v2', cores=1, gpu=False, batch_size=128):
    """
    Embed the sentences/text using the MiniLM language model (which uses mean pooling)
    """
    print('Embedding data')
    model = SentenceTransformer(model_name)
    print('Model loaded')

    sentences = data[key].tolist()
    unique_sentences = data[key].unique()
    print('Unique sentences', len(unique_sentences))

    if cores == 1:
        embeddings = model.encode(unique_sentences, show_progress_bar=True, batch_size=batch_size)
    else:
        devices = ['cpu'] * cores
        if gpu:
            devices = None  # use all CUDA devices

        # Start the multi-process pool on multiple devices
        print('Multi-process pool starting')
        pool = model.start_multi_process_pool(devices)
        print('Multi-process pool started')

        chunk_size = math.ceil(len(unique_sentences) / cores)

        # Compute the embeddings using the multi-process pool
        embeddings = model.encode_multi_process(unique_sentences, pool, batch_size=batch_size, chunk_size=chunk_size)
        model.stop_multi_process_pool(pool)

    print("Embeddings computed")

    mapping = {sentence: embedding for sentence, embedding in zip(unique_sentences, embeddings)}
    embeddings = np.array([mapping[sentence] for sentence in sentences])
  
    return embeddings

# Clustering Code


In [6]:
from collections import defaultdict
import numpy as np
from joblib import Parallel, delayed
from funcy import log_durations
import logging
from tqdm import tqdm
import math
import numpy as np
import torch
from joblib import delayed
from tqdm import tqdm
from torch import Tensor
import pickle5 as pickle
import os


def cos_sim(a: Tensor, b: Tensor):
    """
    Computes the cosine similarity cos_sim(a[i], b[j]) for all i and j.
    :return: Matrix with res[i][j]  = cos_sim(a[i], b[j])
    """
    if not isinstance(a, torch.Tensor):
        a = torch.tensor(np.array(a))

    if not isinstance(b, torch.Tensor):
        b = torch.tensor(np.array(b))

    if len(a.shape) == 1:
        a = a.unsqueeze(0)

    if len(b.shape) == 1:
        b = b.unsqueeze(0)

    a_norm = torch.nn.functional.normalize(a, p=2, dim=1)
    b_norm = torch.nn.functional.normalize(b, p=2, dim=1)
    return torch.mm(a_norm, b_norm.transpose(0, 1))


def get_embeddings(ids, embeddings):
    return np.array([embeddings[idx] for idx in ids])


def reorder_and_filter_cluster(
    cluster_idx, cluster, cluster_embeddings, cluster_head_embedding, threshold
):
    cos_scores = cos_sim(cluster_head_embedding, cluster_embeddings)
    sorted_vals, indices = torch.sort(cos_scores[0], descending=True)
    bigger_than_threshold = sorted_vals > threshold
    indices = indices[bigger_than_threshold]
    sorted_vals = sorted_vals.numpy()
    return cluster_idx, [(cluster[i][0], sorted_vals[i]) for i in indices]


def get_ids(cluster):
    return [transaction[0] for transaction in cluster]


def reorder_and_filter_clusters(clusters, embeddings, threshold, parallel):
    results = parallel(
        delayed(reorder_and_filter_cluster)(
            cluster_idx,
            cluster,
            get_embeddings(get_ids(cluster), embeddings),
            get_embeddings([cluster_idx], embeddings),
            threshold,
        )
        for cluster_idx, cluster in tqdm(clusters.items())
    )

    clusters = {k: v for k, v in results}

    return clusters


def get_embeddings(ids, embeddings):
    return np.array([embeddings[idx] for idx in ids])


def get_clustured_ids(clusters):
    clustered_ids = set(
        [transaction[0] for cluster in clusters.values() for transaction in cluster]
    )
    clustered_ids |= set(clusters.keys())
    return clustered_ids


def get_clusters_ids(clusters):
    return list(clusters.keys())


def get_unclustured_ids(ids, clusters):
    clustered_ids = get_clustured_ids(clusters)
    unclustered_ids = list(set(ids) - clustered_ids)
    return unclustered_ids


def sort_clusters(clusters):
    return dict(
        sorted(clusters.items(), key=lambda x: len(x[1]), reverse=True)
    )  # sort based on size


def sort_cluster(cluster):
    return list(
        sorted(cluster, key=lambda x: x[1], reverse=True)
    )  # sort based on similarity


def filter_clusters(clusters, min_cluster_size):
    return {k: v for k, v in clusters.items() if len(v) >= min_cluster_size}


def unique(collection):
    return list(dict.fromkeys(collection))


def unique_txs(collection):
    seen = set()
    return [x for x in collection if not (x[0] in seen or seen.add(x[0]))]


def write_pickle(data, path):
    os.makedirs(os.path.dirname(path), exist_ok=True)
    with open(path, "wb") as f:
        pickle.dump(data, f, protocol=pickle.HIGHEST_PROTOCOL)


def load_pickle(path):
    with open(path, "rb") as f:
        return pickle.load(f)


def chunk(txs, chunk_size):
    n = math.ceil(len(txs) / chunk_size)
    k, m = divmod(len(txs), n)
    return (txs[i * k + min(i, m) : (i + 1) * k + min(i + 1, m)] for i in range(n) )



def online_community_detection(
    ids,
    embeddings,
    clusters=None,
    threshold=0.7,
    min_cluster_size=3,
    chunk_size=2500,
    iterations=10,
    cores=1,
):
    if clusters is None:
        clusters = {}

    with Parallel(n_jobs=cores) as parallel:
        for iteration in range(iterations):
            print("1. Nearest cluster")
            unclustered_ids = get_unclustured_ids(ids, clusters)
            cluster_ids = list(clusters.keys())
            print("Unclustured", len(unclustered_ids))
            print("Clusters", len(cluster_ids))
            clusters = nearest_cluster(
                unclustered_ids,
                embeddings,
                clusters,
                chunk_size=chunk_size,
                parallel=parallel,
            )
            print("\n\n")

            print("2. Create new clusters")
            unclustered_ids = get_unclustured_ids(ids, clusters)
            print("Unclustured", len(unclustered_ids))
            new_clusters = create_clusters(
                unclustered_ids,
                embeddings,
                clusters={},
                min_cluster_size=3,
                chunk_size=chunk_size,
                threshold=threshold,
                parallel=parallel,
            )
            new_cluster_ids = list(new_clusters.keys())
            print("\n\n")

            print("3. Merge new clusters", len(new_cluster_ids))
            max_clusters_size = 25000
            while True:
                new_cluster_ids = list(new_clusters.keys())
                old_new_cluster_ids = new_cluster_ids
                new_clusters = create_clusters(
                    new_cluster_ids,
                    embeddings,
                    new_clusters,
                    min_cluster_size=1,
                    chunk_size=max_clusters_size,
                    threshold=threshold,
                    parallel=parallel,
                )
                new_clusters = filter_clusters(new_clusters, 2)

                new_cluster_ids = list(new_clusters.keys())
                print("New merged clusters", len(new_cluster_ids))
                if len(old_new_cluster_ids) < max_clusters_size:
                    break

            new_clusters = filter_clusters(new_clusters, min_cluster_size)
            print(
                f"New clusters with min community size >= {min_cluster_size}",
                len(new_clusters),
            )
            clusters = {**new_clusters, **clusters}
            print("Total clusters", len(clusters))
            clusters = sort_clusters(clusters)
            print("\n\n")

            print("4. Nearest cluster")
            unclustered_ids = get_unclustured_ids(ids, clusters)
            cluster_ids = list(clusters.keys())
            print("Unclustured", len(unclustered_ids))
            print("Clusters", len(cluster_ids))
            clusters = nearest_cluster(
                unclustered_ids,
                embeddings,
                clusters,
                chunk_size=chunk_size,
                parallel=parallel,
            )
            clusters = sort_clusters(clusters)

            unclustered_ids = get_unclustured_ids(ids, clusters)
            clustured_ids = get_clustured_ids(clusters)
            print("Clustured", len(clustured_ids))
            print("Unclustured", len(unclustered_ids))
            print(
                f"Percentage clustured {len(clustured_ids) / (len(clustured_ids) + len(unclustered_ids)) * 100:.2f}%"
            )

            print("\n\n")
    return clusters


def get_ids(cluster):
    return [transaction[0] for transaction in cluster]


def nearest_cluster_chunk(
    chunk_ids, chunk_embeddings, cluster_ids, cluster_embeddings, threshold
):
    cos_scores = cos_sim(chunk_embeddings, cluster_embeddings)
    top_val_large, top_idx_large = cos_scores.topk(k=1, largest=True)
    top_idx_large = top_idx_large[:, 0].tolist()
    top_val_large = top_val_large[:, 0].tolist()
    cluster_assignment = []
    for i, (score, idx) in enumerate(zip(top_val_large, top_idx_large)):
        cluster_id = cluster_ids[idx]
        if score < threshold:
            cluster_id = None
        cluster_assignment.append(((chunk_ids[i], score), cluster_id))
    return cluster_assignment


def nearest_cluster(
    transaction_ids,
    embeddings,
    clusters=None,
    parallel=None,
    threshold=0.75,
    chunk_size=2500,
):
    cluster_ids = list(clusters.keys())
    if len(cluster_ids) == 0:
        return clusters
    cluster_embeddings = get_embeddings(cluster_ids, embeddings)

    c = list(chunk(transaction_ids, chunk_size))

    with log_durations(logging.info, "Parallel jobs nearest cluster"):
        out = parallel(
            delayed(nearest_cluster_chunk)(
                chunk_ids,
                get_embeddings(chunk_ids, embeddings),
                cluster_ids,
                cluster_embeddings,
                threshold,
            )
            for chunk_ids in tqdm(c)
        )
        cluster_assignment = [assignment for sublist in out for assignment in sublist]

    for (transaction_id, similarity), cluster_id in cluster_assignment:
        if cluster_id is None:
            continue
        clusters[cluster_id].append(
            (transaction_id, similarity)
        )  # TODO sort in right order

    clusters = {
        cluster_id: unique_txs(sort_cluster(cluster))
        for cluster_id, cluster in clusters.items()
    }  # Sort based on similarity

    return clusters


def create_clusters(
    ids,
    embeddings,
    clusters=None,
    parallel=None,
    min_cluster_size=3,
    threshold=0.75,
    chunk_size=2500,
):
    to_cluster_ids = np.array(ids)
    np.random.shuffle(
        to_cluster_ids
    )  # TODO evaluate performance without, try sorted list

    c = list(chunk(to_cluster_ids, chunk_size))

    with log_durations(logging.info, "Parallel jobs create clusters"):
        out = parallel(
            delayed(fast_clustering)(
                chunk_ids,
                get_embeddings(chunk_ids, embeddings),
                threshold,
                min_cluster_size,
            )
            for chunk_ids in tqdm(c)
        )

    # Combine output
    new_clusters = {}
    for out_clusters in out:
        for idx, cluster in out_clusters.items():
            # new_clusters[idx] = unique([(idx, 1)] + new_clusters.get(idx, []) + cluster)
            new_clusters[idx] = unique_txs(cluster + new_clusters.get(idx, []))

    # Add ids from old cluster to new cluster
    for cluster_idx, cluster in new_clusters.items():
        community_extended = []
        for (idx, similarity) in cluster:
            community_extended += [(idx, similarity)] + clusters.get(idx, [])
        new_clusters[cluster_idx] = unique_txs(community_extended)

    new_clusters = reorder_and_filter_clusters(
        new_clusters, embeddings, threshold, parallel
    )  # filter to keep only the relevant
    new_clusters = sort_clusters(new_clusters)

    clustered_ids = set()
    for idx, cluster_ids in new_clusters.items():
        filtered = set(cluster_ids) - clustered_ids
        cluster_ids = [
            cluster_idx for cluster_idx in cluster_ids if cluster_idx in filtered
        ]
        new_clusters[idx] = cluster_ids
        clustered_ids |= set(cluster_ids)

    new_clusters = filter_clusters(new_clusters, min_cluster_size)
    new_clusters = sort_clusters(new_clusters)
    return new_clusters


def fast_clustering(ids, embeddings, threshold=0.70, min_cluster_size=10):
    """
    Function for Fast Clustering

    Finds in the embeddings all communities, i.e. embeddings that are close (closer than threshold).
    """

    # Compute cosine similarity scores
    cos_scores = cos_sim(embeddings, embeddings)

    # Step 1) Create clusters where similarity is bigger than threshold
    bigger_than_threshold = cos_scores >= threshold
    indices = bigger_than_threshold.nonzero()

    cos_scores = cos_scores.numpy()

    extracted_clusters = defaultdict(lambda: [])
    for row, col in indices.tolist():
        extracted_clusters[ids[row]].append((ids[col], cos_scores[row, col]))

    extracted_clusters = sort_clusters(extracted_clusters)  # FIXME

    # Step 2) Remove overlapping clusters
    unique_clusters = {}
    extracted_ids = set()

    for cluster_id, cluster in extracted_clusters.items():
        add_cluster = True
        for transaction in cluster:
            if transaction[0] in extracted_ids:
                add_cluster = False
                break

        if add_cluster:
            unique_clusters[cluster_id] = cluster
            for transaction in cluster:
                extracted_ids.add(transaction[0])

    new_clusters = {}
    for cluster_id, cluster in unique_clusters.items():
        community_extended = []
        for idx in cluster:
            community_extended.append(idx)
        new_clusters[cluster_id] = unique_txs(community_extended)

    new_clusters = filter_clusters(new_clusters, min_cluster_size)

    return new_clusters


# Run

In [None]:
# train = pd.read_csv('./queries.train.tsv', sep='\t', names=['id', 'query'])
# dev = pd.read_csv('./queries.dev.tsv', sep='\t', names=['id', 'query'])
# eval = pd.read_csv('./queries.eval.tsv', sep='\t', names=['id', 'query'])
# data = pd.concat([train, dev, eval])

In [7]:
ids = data.id

In [8]:
embeddings = embed_data(data, 'query', cores=1)
embeddings = {idx: embedding for idx, embedding in zip(ids, embeddings)}

Embedding data


Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Model loaded
Unique sentences 3717


Batches:   0%|          | 0/30 [00:00<?, ?it/s]

Embeddings computed


In [9]:
clusters = {}

In [10]:
clusters = online_community_detection(ids, embeddings, clusters, chunk_size=1500)

1. Nearest cluster
Unclustured 3511
Clusters 0



2. Create new clusters
Unclustured 3511


100%|██████████| 3/3 [00:00<00:00, 21.35it/s]
100%|██████████| 180/180 [00:00<00:00, 3518.89it/s]





3. Merge new clusters 180


100%|██████████| 1/1 [00:00<00:00, 131.81it/s]
100%|██████████| 106/106 [00:00<00:00, 3338.91it/s]


New merged clusters 106
New clusters with min community size >= 3 106
Total clusters 106



4. Nearest cluster
Unclustured 2732
Clusters 106


100%|██████████| 2/2 [00:00<00:00, 114.13it/s]


Clustured 861
Unclustured 2650
Percentage clustured 24.52%



1. Nearest cluster
Unclustured 2650
Clusters 106


100%|██████████| 2/2 [00:00<00:00, 161.50it/s]





2. Create new clusters
Unclustured 2650


100%|██████████| 2/2 [00:00<00:00, 34.53it/s]
100%|██████████| 100/100 [00:00<00:00, 3467.60it/s]





3. Merge new clusters 100


100%|██████████| 1/1 [00:00<00:00, 525.80it/s]
100%|██████████| 87/87 [00:00<00:00, 2519.73it/s]


New merged clusters 87
New clusters with min community size >= 3 87
Total clusters 193



4. Nearest cluster
Unclustured 2297
Clusters 193


100%|██████████| 2/2 [00:00<00:00, 121.21it/s]


Clustured 1242
Unclustured 2269
Percentage clustured 35.37%



1. Nearest cluster
Unclustured 2269
Clusters 193


100%|██████████| 2/2 [00:00<00:00, 101.66it/s]





2. Create new clusters
Unclustured 2269


100%|██████████| 2/2 [00:00<00:00, 38.51it/s]
100%|██████████| 27/27 [00:00<00:00, 2852.05it/s]





3. Merge new clusters 27


100%|██████████| 1/1 [00:00<00:00, 184.11it/s]
100%|██████████| 27/27 [00:00<00:00, 2751.17it/s]


New merged clusters 27
New clusters with min community size >= 3 27
Total clusters 220



4. Nearest cluster
Unclustured 2185
Clusters 220


100%|██████████| 2/2 [00:00<00:00, 140.29it/s]


Clustured 1333
Unclustured 2178
Percentage clustured 37.97%



1. Nearest cluster
Unclustured 2178
Clusters 220


100%|██████████| 2/2 [00:00<00:00, 156.57it/s]





2. Create new clusters
Unclustured 2178


100%|██████████| 2/2 [00:00<00:00, 51.70it/s]
100%|██████████| 12/12 [00:00<00:00, 2312.08it/s]





3. Merge new clusters 12


100%|██████████| 1/1 [00:00<00:00, 1201.46it/s]
100%|██████████| 12/12 [00:00<00:00, 1114.86it/s]


New merged clusters 12
New clusters with min community size >= 3 12
Total clusters 232



4. Nearest cluster
Unclustured 2141
Clusters 232


100%|██████████| 2/2 [00:00<00:00, 18.78it/s]


Clustured 1370
Unclustured 2141
Percentage clustured 39.02%



1. Nearest cluster
Unclustured 2141
Clusters 232


100%|██████████| 2/2 [00:00<00:00, 132.00it/s]





2. Create new clusters
Unclustured 2141


100%|██████████| 2/2 [00:00<00:00, 43.75it/s]
100%|██████████| 6/6 [00:00<00:00, 1035.42it/s]





3. Merge new clusters 6


100%|██████████| 1/1 [00:00<00:00, 1673.70it/s]
100%|██████████| 6/6 [00:00<00:00, 571.76it/s]


New merged clusters 6
New clusters with min community size >= 3 6
Total clusters 238



4. Nearest cluster
Unclustured 2123
Clusters 238


100%|██████████| 2/2 [00:00<00:00, 123.80it/s]


Clustured 1388
Unclustured 2123
Percentage clustured 39.53%



1. Nearest cluster
Unclustured 2123
Clusters 238


100%|██████████| 2/2 [00:00<00:00, 128.29it/s]





2. Create new clusters
Unclustured 2123


100%|██████████| 2/2 [00:00<00:00, 55.65it/s]
100%|██████████| 5/5 [00:00<00:00, 822.90it/s]





3. Merge new clusters 5


100%|██████████| 1/1 [00:00<00:00, 1811.01it/s]
100%|██████████| 5/5 [00:00<00:00, 2163.35it/s]


New merged clusters 5
New clusters with min community size >= 3 5
Total clusters 243



4. Nearest cluster
Unclustured 2108
Clusters 243


100%|██████████| 2/2 [00:00<00:00, 64.69it/s]


Clustured 1403
Unclustured 2108
Percentage clustured 39.96%



1. Nearest cluster
Unclustured 2108
Clusters 243


100%|██████████| 2/2 [00:00<00:00, 119.98it/s]





2. Create new clusters
Unclustured 2108


100%|██████████| 2/2 [00:00<00:00, 61.79it/s]
100%|██████████| 5/5 [00:00<00:00, 1529.76it/s]





3. Merge new clusters 5


100%|██████████| 1/1 [00:00<00:00, 269.80it/s]
100%|██████████| 5/5 [00:00<00:00, 757.94it/s]


New merged clusters 5
New clusters with min community size >= 3 5
Total clusters 248



4. Nearest cluster
Unclustured 2093
Clusters 248


100%|██████████| 2/2 [00:00<00:00, 106.22it/s]


Clustured 1418
Unclustured 2093
Percentage clustured 40.39%



1. Nearest cluster
Unclustured 2093
Clusters 248


100%|██████████| 2/2 [00:00<00:00, 119.55it/s]





2. Create new clusters
Unclustured 2093


100%|██████████| 2/2 [00:00<00:00, 53.33it/s]
100%|██████████| 5/5 [00:00<00:00, 1675.58it/s]





3. Merge new clusters 5


100%|██████████| 1/1 [00:00<00:00, 206.98it/s]
100%|██████████| 5/5 [00:00<00:00, 940.76it/s]


New merged clusters 5
New clusters with min community size >= 3 5
Total clusters 253



4. Nearest cluster
Unclustured 2078
Clusters 253


100%|██████████| 2/2 [00:00<00:00, 112.82it/s]


Clustured 1433
Unclustured 2078
Percentage clustured 40.81%



1. Nearest cluster
Unclustured 2078
Clusters 253


100%|██████████| 2/2 [00:00<00:00, 163.43it/s]





2. Create new clusters
Unclustured 2078


100%|██████████| 2/2 [00:00<00:00, 51.43it/s]
100%|██████████| 6/6 [00:00<00:00, 627.50it/s]





3. Merge new clusters 6


100%|██████████| 1/1 [00:00<00:00, 1738.21it/s]
100%|██████████| 6/6 [00:00<00:00, 2647.92it/s]


New merged clusters 6
New clusters with min community size >= 3 6
Total clusters 259



4. Nearest cluster
Unclustured 2060
Clusters 259


100%|██████████| 2/2 [00:00<00:00, 92.30it/s]


Clustured 1451
Unclustured 2060
Percentage clustured 41.33%



1. Nearest cluster
Unclustured 2060
Clusters 259


100%|██████████| 2/2 [00:00<00:00, 81.61it/s]





2. Create new clusters
Unclustured 2060


100%|██████████| 2/2 [00:00<00:00, 60.81it/s]
100%|██████████| 3/3 [00:00<00:00, 2623.08it/s]





3. Merge new clusters 3


100%|██████████| 1/1 [00:00<00:00, 889.00it/s]
100%|██████████| 3/3 [00:00<00:00, 1412.54it/s]


New merged clusters 3
New clusters with min community size >= 3 3
Total clusters 262



4. Nearest cluster
Unclustured 2051
Clusters 262


100%|██████████| 2/2 [00:00<00:00, 98.47it/s]


Clustured 1460
Unclustured 2051
Percentage clustured 41.58%





In [11]:
value_counts_list = []
for cluster in list(clusters.values()):
  value_counts_list.append('\n'.join(data['query'][data.id.isin([tx[0] for tx in cluster])])+'\n\n')
pd.DataFrame(value_counts_list).to_excel("new_hcp.xlsx")  

In [12]:
len(clusters)

262

In [13]:
df1 = pd.read_excel('new_hcp.xlsx').rename(columns = {'Unnamed: 0':'Cluster',0 : 'query'}).reset_index(drop=True)
df1.head()

Unnamed: 0,Cluster,query
0,0,Database back failing Oracle backup problems |...
1,1,undefined | 00847916\nundefined | 00846491\nun...
2,2,Add additional Node to AWS cluster | 00846583\...
3,3,CE03008019 Yoda agent unhealthy | 00848023\nAl...
4,4,Need assistance upgrading the Cohesity cluster...
