UMAP:
https://arxiv.org/abs/1802.03426

coUMAP: file:///Users/dshiebler/Downloads/Topological%20Methods%20for%20Unsupervised%20Learning.pdf

HDBSCAN https://arxiv.org/pdf/1705.07321.pdf

HDBSCAN Review https://www.overleaf.com/project/5e96fc13d759bd000197b47d

Available Gensim Embeddings https://github.com/RaRe-Technologies/gensim-data#datasets

**Question**: What is the purpose of the switch from the Rips complex to the Lesnick complex?

**Answer**: Restricting to the Lesnick complex reduces the number of edges in the graph to only those between "core points," or points that have at least $k$ neighbors in their $\epsilon$ neighborhood


**Question**: In co-UMAP, we "stitch" together the fuzzy simplicial complexes formed from the rescaled metric spaces into one metric space. Why don't we need to do this in HDBSCAN? Do we actually need to do this in co-UMAP?

**Answer**: We don't need to do this in HDBSCAN because we can just form a fancy locally-reweighted metric space by rescaling distances locally. So the combining happens at the metric space level rather than the simplicial set level. I think we can do the same thing in UMAP and that we don't need to think about stitching things in simplicial set space. I also think that the simplicial set intersection is equivalent to building the rescaled metric space and then taking the simplicial set. 



**Question**: Where is the varying $\epsilon$ coming in? How is this actually being used right now?

**Answer**: This comes in when we construct the single linkage tree from the minimum spanning tree



**Question**: Where do `min_cluster_size` and `min_samples` come in?

**Answer**: `min_samples` controls the distance metric itself, since this is the parameter that indicates which neighbor is the distance-controlling neighbor. `min_cluster_size` comes in at the final step, when we compute the single linkage tree to a flat clustering



In [1]:
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
from tqdm import tqdm
import annoy
from annoy import AnnoyIndex
import random
import numpy as np
import pandas as pd
import seaborn as sns
import time
from sklearn.cluster import KMeans
from collections import defaultdict


sns.reset_defaults()
sns.set_context(context='talk',font_scale=0.7)

%matplotlib inline
%load_ext autoreload

%autoreload 2

# model = api.load("glove-twitter-25")
# model = api.load("glove-wiki-gigaword-50")
# model = api.load("glove-wiki-gigaword-300")

# vocab_path = "/Users/dshiebler/workspace/data/glove-wiki-gigaword-50/vocab_words.npy"
# embedding_path = "/Users/dshiebler/workspace/data/glove-wiki-gigaword-50/embeddings.npy"
# np.save(vocab_path, vocab_words)
# np.save(embedding_path, embeddings)



unable to import 'smart_open.gcs', disabling that module


In [2]:
# entity_ids = np.load("/Users/dshiebler/workspace/data/tagspace_entity_ids.npy")
# embeddings = np.load("/Users/dshiebler/workspace/data/tagspace_embeddings.npy")

In [5]:
vocab_path = "/Users/dshiebler/workspace/data/glove-wiki-gigaword-50/vocab_words.npy"
embedding_path = "/Users/dshiebler/workspace/data/glove-wiki-gigaword-50/embeddings.npy"

entity_ids = np.load(vocab_path)
embeddings = np.load(embedding_path)
word_to_index = {k: i for i, k in enumerate(entity_ids)}


In [6]:

class KeyTree(object):
    
    def __init__(self, keys, values):
        self.keys = keys
        self.values = values
        self.tree = AnnoyIndex(values.shape[1], 'angular')
        for i, v in tqdm(enumerate(values)):
            self.tree.add_item(i, v)
        print("building tree...")
        self.tree.build(10)
        print("tree built!")
        
    def get_nns_by_vector(self, vector, num_neighbors):
        indices = self.tree.get_nns_by_vector(vector, num_neighbors)
        return [self.keys[i] for i in indices]

def get_label_to_words(labels, entity_ids):
    label_to_words = defaultdict(list)
    for label, entity_id in zip(labels, entity_ids):
        label_to_words[label].append(entity_id)
    return label_to_words

def optionally_convert_to_numeric(a):
    try:
        return int(a)
    except Exception:
        return a

In [7]:
# tree = KeyTree(keys=entity_ids, values=embeddings)
# index = word_to_index["obama"]
# print(entity_ids[index])
# print(tree.tree.get_nns_by_vector(
#     vector=embeddings[index], n=10, include_distances=True))

In [8]:
from hdbscan import HDBSCAN

offset = 100
num_words = 10000
cut_words = entity_ids[offset:offset+num_words]
cut_embeddings = embeddings[offset:offset+num_words]


start = time.time()
hdbscan_clusterer = HDBSCAN(
    algorithm='boruvka_kdtree',
    min_cluster_size=2,
    core_dist_n_jobs=1, prediction_data=True)


hdbscan_clusterer.fit(cut_embeddings)
coverage = sum(
    hdbscan_clusterer.labels_ > -1) / len(hdbscan_clusterer.labels_)
print("Coverage: {}".format(coverage))
hdbscan_label_to_words = get_label_to_words(
    labels=hdbscan_clusterer.labels_, entity_ids=cut_words)
for l in hdbscan_label_to_words:
    print([optionally_convert_to_numeric(a) for a in hdbscan_label_to_words[l][:100]])
    print("-----------------")


__init__ called (1)
Checking algorithm
calling _hdbscan_boruvka_kdtree
building AnnoyTree
AnnoyTree built. building KDTreeBoruvkaAlgorithm
building KDTreeBoruvkaAlgorithm
running _initialize_components
running _compute_bounds
computed core distances after running knn. First 20 elements of self.core_distance_arr: [1.58064318 2.50566006 1.45178533 2.73400903 3.41486526 2.35332799
 1.90886295 2.42778444 2.351583   2.33710718 2.52732444 2.52514935
 3.69890213 2.10792232 1.92786276 3.06855226 2.45455551 2.5707438
 3.25401258 2.19480538]
point 15 is getting left out with core_distance 9.416012945534987
point 32 is getting left out with core_distance 8.35439247639465
point 43 is getting left out with core_distance 3.224433862471315
point 45 is getting left out with core_distance 10.66460561961867
point 51 is getting left out with core_distance 2.3144286436947823
point 55 is getting left out with core_distance 4.500674305099949
point 59 is getting left out with core_distance 2.996802835838267



node 616 node_info {'idx_start': 2050, 'idx_end': 2070, 'is_leaf': 1, 'radius': 7.56286412247305}
point 7754 in node 616 is in component 4481 rather than current_component 2556

node 615 node_info {'idx_start': 2031, 'idx_end': 2050, 'is_leaf': 1, 'radius': 8.154091061624017}
point 3120 in node 615 is in component 6396 rather than current_component 3963

node 614 node_info {'idx_start': 2011, 'idx_end': 2031, 'is_leaf': 1, 'radius': 6.635696502963793}
point 3836 in node 614 is in component 2115 rather than current_component 30

node 613 node_info {'idx_start': 1992, 'idx_end': 2011, 'is_leaf': 1, 'radius': 6.356839054151014}
point 1444 in node 613 is in component 2268 rather than current_component 105

node 612 node_info {'idx_start': 1972, 'idx_end': 1992, 'is_leaf': 1, 'radius': 6.636791568580931}
point 2740 in node 612 is in component 1405 rather than current_component 1933

node 611 node_info {'idx_start': 1953, 'idx_end': 1972, 'is_leaf': 1, 'radius': 6.57586746687341}
point 9976

Found point 3385 to be the new nearest neighbor to component 345 with distance 8.119584693283798
Found point 2954 to be the new nearest neighbor to component 2978 with distance 5.852334569616687
Found point 5723 to be the new nearest neighbor to component 7578 with distance 14.665973779197069
Found point 673 to be the new nearest neighbor to component 79 with distance 4.4081959027216175
Found point 673 to be the new nearest neighbor to component 30 with distance 6.109605212274337
Found point 8765 to be the new nearest neighbor to component 1393 with distance 12.71106797339945
Found point 6539 to be the new nearest neighbor to component 5611 with distance 18.554115715829766
Found point 1109 to be the new nearest neighbor to component 1056 with distance 8.470999435996646
Found point 4321 to be the new nearest neighbor to component 2706 with distance 21.823208211513702
Found point 3980 to be the new nearest neighbor to component 1844 with distance 7.727504666679664
Found point 1763 to be 

Found point 2819 to be the new nearest neighbor to component 2137 with distance 14.662559588031879
Found point 1342 to be the new nearest neighbor to component 4742 with distance 27.488552033349887
Found point 5685 to be the new nearest neighbor to component 4742 with distance 23.907567297868148
Found point 2077 to be the new nearest neighbor to component 883 with distance 5.774837634785906
Found point 5417 to be the new nearest neighbor to component 2054 with distance 13.855076777353727
Found point 3924 to be the new nearest neighbor to component 3290 with distance 12.018660543870844
Found point 5971 to be the new nearest neighbor to component 2003 with distance 7.63019476039976
Found point 740 to be the new nearest neighbor to component 1510 with distance 9.917886879304433
Found point 5935 to be the new nearest neighbor to component 4150 with distance 16.495054657303854
Found point 4385 to be the new nearest neighbor to component 1767 with distance 5.954638151026529
Found point 6730 

Found point 7681 to be the new nearest neighbor to component 4011 with distance 11.998540882366624
Found point 4462 to be the new nearest neighbor to component 1778 with distance 15.699825561492219
Found point 2260 to be the new nearest neighbor to component 3848 with distance 11.203317944763569
Found point 5501 to be the new nearest neighbor to component 6459 with distance 16.528254369698548
Found point 4666 to be the new nearest neighbor to component 6459 with distance 15.686154451736474
Found point 7648 to be the new nearest neighbor to component 9366 with distance 18.169996314468545
Found point 6135 to be the new nearest neighbor to component 1844 with distance 5.1978475923434075
Found point 7832 to be the new nearest neighbor to component 1949 with distance 11.92903417503067
Found point 6519 to be the new nearest neighbor to component 215 with distance 14.509259915262335
Found point 4787 to be the new nearest neighbor to component 1635 with distance 9.967541618573659
Found point 6

Found point 4931 to be the new nearest neighbor to component 4627 with distance 12.974531903071753
Found point 2596 to be the new nearest neighbor to component 4297 with distance 16.388888404029107
Found point 5368 to be the new nearest neighbor to component 4297 with distance 14.246059988146701
Found point 8219 to be the new nearest neighbor to component 729 with distance 12.789777139853863
Found point 6442 to be the new nearest neighbor to component 3813 with distance 17.48385613799556
Found point 1422 to be the new nearest neighbor to component 582 with distance 6.880626397150259
Found point 9101 to be the new nearest neighbor to component 2324 with distance 11.21048648872543
Found point 4971 to be the new nearest neighbor to component 6239 with distance 17.64608595903666
Found point 6869 to be the new nearest neighbor to component 7095 with distance 11.812738166845747
Found point 7455 to be the new nearest neighbor to component 5688 with distance 10.321451135139874
Found point 4579

Found point 9318 to be the new nearest neighbor to component 6019 with distance 9.741335721420304
Found point 6447 to be the new nearest neighbor to component 4165 with distance 6.555125674721239
Found point 2051 to be the new nearest neighbor to component 950 with distance 9.346843663492757
Found point 2590 to be the new nearest neighbor to component 3281 with distance 9.51153908025355
Found point 2088 to be the new nearest neighbor to component 2251 with distance 8.071261336183408
Found point 594 to be the new nearest neighbor to component 731 with distance 9.890455020151649
Found point 2179 to be the new nearest neighbor to component 1099 with distance 5.233951722684878
Found point 4138 to be the new nearest neighbor to component 1168 with distance 10.850628606567625
Found point 4845 to be the new nearest neighbor to component 4011 with distance 9.104032446448912
Found point 2088 to be the new nearest neighbor to component 1406 with distance 5.493605518269362
Found point 4883 to be 

Found point 8147 to be the new nearest neighbor to component 4986 with distance 8.950073094657778
Found point 1549 to be the new nearest neighbor to component 6575 with distance 9.36527301885166
Found point 8712 to be the new nearest neighbor to component 660 with distance 10.049124240491375
Found point 1878 to be the new nearest neighbor to component 6397 with distance 9.362261311880353
Found point 8501 to be the new nearest neighbor to component 3181 with distance 9.517017819483101
Found point 733 to be the new nearest neighbor to component 416 with distance 6.83062495038535
Found point 404 to be the new nearest neighbor to component 3947 with distance 6.936019731541876
Found point 4568 to be the new nearest neighbor to component 3589 with distance 9.583242033158058
Found point 5225 to be the new nearest neighbor to component 8026 with distance 11.58752562907126
Found point 4727 to be the new nearest neighbor to component 3379 with distance 16.149739888737713
Found point 5799 to be t

Found point 5013 to be the new nearest neighbor to component 1924 with distance 14.91261442753779
Found point 1954 to be the new nearest neighbor to component 4900 with distance 12.491882020248813
Found point 5843 to be the new nearest neighbor to component 4900 with distance 11.274233996644666
Found point 3141 to be the new nearest neighbor to component 4900 with distance 10.816089399407304
Found point 6022 to be the new nearest neighbor to component 4900 with distance 10.470211151796939
Found point 6626 to be the new nearest neighbor to component 3777 with distance 10.106216825885296
Found point 6384 to be the new nearest neighbor to component 3328 with distance 5.0366283639687595
Found point 6882 to be the new nearest neighbor to component 1193 with distance 9.489031438268976
Found point 8409 to be the new nearest neighbor to component 8728 with distance 12.774554574669786
Found point 4283 to be the new nearest neighbor to component 619 with distance 13.884346892090434
Found point 8

            
traversing self.tree

node 1022 node_info {'idx_start': 9980, 'idx_end': 10000, 'is_leaf': 1, 'radius': 7.647163068970979}
point 2566 in node 1022 is in component 1099 rather than current_component 541

node 1021 node_info {'idx_start': 9960, 'idx_end': 9980, 'is_leaf': 1, 'radius': 7.5581299185718915}
point 4576 in node 1021 is in component 772 rather than current_component 6482

node 1020 node_info {'idx_start': 9940, 'idx_end': 9960, 'is_leaf': 1, 'radius': 6.980157344467381}
point 5535 in node 1020 is in component 6538 rather than current_component 6144

node 1019 node_info {'idx_start': 9921, 'idx_end': 9940, 'is_leaf': 1, 'radius': 8.13210839600272}
point 7675 in node 1019 is in component 5501 rather than current_component 5107

node 1018 node_info {'idx_start': 9901, 'idx_end': 9921, 'is_leaf': 1, 'radius': 6.411610878170073}
point 1610 in node 1018 is in component 1103 rather than current_component 1099

node 1017 node_info {'idx_start': 9882, 'idx_end': 9901, 'is_

Found point 2005 to be the new nearest neighbor to component 270 with distance 5.193236522797788
Found point 1234 to be the new nearest neighbor to component 4284 with distance 33.81050346526115
Found point 3898 to be the new nearest neighbor to component 4284 with distance 29.59322945888331
Found point 4425 to be the new nearest neighbor to component 4284 with distance 26.377894870344505
Found point 7245 to be the new nearest neighbor to component 4284 with distance 25.55384173202713
Found point 1234 to be the new nearest neighbor to component 277 with distance 37.66709638906769
Found point 6553 to be the new nearest neighbor to component 277 with distance 36.78837510307947
Found point 3898 to be the new nearest neighbor to component 277 with distance 34.60755837984896
Found point 1234 to be the new nearest neighbor to component 1538 with distance 21.577543760484843
Found point 4198 to be the new nearest neighbor to component 1538 with distance 15.044209033266302
Found point 4425 to b

Found point 2771 to be the new nearest neighbor to component 336 with distance 9.41417322792209
Found point 2771 to be the new nearest neighbor to component 336 with distance 9.33361102556779
Found point 51 to be the new nearest neighbor to component 270 with distance 3.473915927594362
Found point 1468 to be the new nearest neighbor to component 1099 with distance 5.442186730361257
Found point 2211 to be the new nearest neighbor to component 935 with distance 12.572893271439114
Found point 1386 to be the new nearest neighbor to component 5501 with distance 10.96776344982722
Found point 4104 to be the new nearest neighbor to component 4173 with distance 10.988492123229804
Found point 2402 to be the new nearest neighbor to component 502 with distance 9.352427571338978
Found point 8540 to be the new nearest neighbor to component 1718 with distance 14.362081979273867
Found point 1386 to be the new nearest neighbor to component 5501 with distance 9.864754996846957
Found point 8858 to be the

Found point 5000 to be the new nearest neighbor to component 1836 with distance 12.001584619253617
Found point 4139 to be the new nearest neighbor to component 1836 with distance 11.991136540251713
Found point 6329 to be the new nearest neighbor to component 1836 with distance 11.906954757186277
Found point 7458 to be the new nearest neighbor to component 1836 with distance 11.718702375810032
Found point 8449 to be the new nearest neighbor to component 1836 with distance 10.520105595905871
Found point 6184 to be the new nearest neighbor to component 1836 with distance 10.520051772132675
Found point 7159 to be the new nearest neighbor to component 1836 with distance 10.41704137575196
Found point 3638 to be the new nearest neighbor to component 1836 with distance 9.220098997994285
Found point 3638 to be the new nearest neighbor to component 7009 with distance 9.51728222155079
Found point 7534 to be the new nearest neighbor to component 4113 with distance 10.892201135130104
Found point 55

Found point 4936 to be the new nearest neighbor to component 1487 with distance 6.397526525061695
Found point 5350 to be the new nearest neighbor to component 4037 with distance 10.701033445565058
Found point 4315 to be the new nearest neighbor to component 4333 with distance 10.853949830885334
Found point 1727 to be the new nearest neighbor to component 4033 with distance 8.348436027660622
Found point 2294 to be the new nearest neighbor to component 4033 with distance 8.246052003906652
Found point 2390 to be the new nearest neighbor to component 2457 with distance 10.874849784248703
Found point 1340 to be the new nearest neighbor to component 4153 with distance 8.97674219721224
Found point 2268 to be the new nearest neighbor to component 805 with distance 9.3555980775605
Found point 4666 to be the new nearest neighbor to component 5107 with distance 9.936818673768869
Found point 5816 to be the new nearest neighbor to component 3631 with distance 12.235309477835074

            self.up

Node 274 has child1 in component -550 and child2 in component -551

node 273 node_info {'idx_start': 703, 'idx_end': 742, 'is_leaf': 0, 'radius': 8.494779484983606}
Node 273 has child1 in component -548 and child2 in component -549

node 272 node_info {'idx_start': 664, 'idx_end': 703, 'is_leaf': 0, 'radius': 8.758017888835639}
Node 272 has child1 in component -546 and child2 in component -547

node 271 node_info {'idx_start': 625, 'idx_end': 664, 'is_leaf': 0, 'radius': 8.250607551739499}
Node 271 has child1 in component -544 and child2 in component -545

node 270 node_info {'idx_start': 585, 'idx_end': 625, 'is_leaf': 0, 'radius': 8.004924863669041}
Node 270 has child1 in component -542 and child2 in component 559

node 269 node_info {'idx_start': 546, 'idx_end': 585, 'is_leaf': 0, 'radius': 8.191520606618742}
Node 269 has child1 in component -540 and child2 in component -541

node 268 node_info {'idx_start': 507, 'idx_end': 546, 'is_leaf': 0, 'radius': 7.2019405040031055}
Node 268 h

Found point 5047 to be the new nearest neighbor to component 2593 with distance 12.534109838552762
Found point 7579 to be the new nearest neighbor to component 5107 with distance 10.044808722097468
Found point 5478 to be the new nearest neighbor to component 1538 with distance 10.925282124038935
Found point 678 to be the new nearest neighbor to component 559 with distance 6.2716849985673395
Found point 307 to be the new nearest neighbor to component 1578 with distance 4.052298295418693
Found point 3894 to be the new nearest neighbor to component 556 with distance 10.013034177406155
Found point 846 to be the new nearest neighbor to component 559 with distance 5.891784176907427
Found point 2207 to be the new nearest neighbor to component 2593 with distance 11.876815633809734
Found point 639 to be the new nearest neighbor to component 2593 with distance 10.16717867595115
Found point 302 to be the new nearest neighbor to component 2593 with distance 9.75312516615398
Found point 302 to be t

point 568 in node 703 is in component 595 rather than current_component 559

node 702 node_info {'idx_start': 3730, 'idx_end': 3750, 'is_leaf': 1, 'radius': 7.454057489784155}
point 6970 in node 702 is in component 595 rather than current_component 559

node 701 node_info {'idx_start': 3710, 'idx_end': 3730, 'is_leaf': 1, 'radius': 8.206612106356562}
point 8514 in node 701 is in component 595 rather than current_component 559

node 700 node_info {'idx_start': 3690, 'idx_end': 3710, 'is_leaf': 1, 'radius': 8.114682627357752}
point 7218 in node 700 is in component 559 rather than current_component 595

node 699 node_info {'idx_start': 3671, 'idx_end': 3690, 'is_leaf': 1, 'radius': 8.561733384360526}
point 4196 in node 699 is in component 559 rather than current_component 595

node 698 node_info {'idx_start': 3651, 'idx_end': 3671, 'is_leaf': 1, 'radius': 7.86953008641455}
point 4813 in node 698 is in component 559 rather than current_component 595

node 697 node_info {'idx_start': 3632, 

Found point 1968 to be the new nearest neighbor to component 559 with distance 7.256876895584945
Found point 1002 to be the new nearest neighbor to component 4153 with distance 10.122333457607581
Found point 186 to be the new nearest neighbor to component 595 with distance 9.458338517666945
Found point 376 to be the new nearest neighbor to component 595 with distance 9.07672092267262

            self.update_components() called with:
                self.candidate_point_arr[:20]: [  0   1   2   2 503   5   6   7   8 972  10  11  12  13  14  -1  50  17
  18  19]
                self.candidate_neighbor_arr[:20]: [  51   59  638  545  537    8  288  112 1552 1098  933   14  395  314
   49   -1   43   51   53  232]
                self.candidate_distance_arr[:20]: [1.79769313e+308 1.79769313e+308 1.79769313e+308 1.79769313e+308
 1.79769313e+308 1.79769313e+308 1.79769313e+308 1.79769313e+308
 1.79769313e+308 1.79769313e+308 1.79769313e+308 1.79769313e+308
 1.79769313e+308 1.79769313e+308 1

node 1007 node_info {'idx_start': 9687, 'idx_end': 9706, 'is_leaf': 1, 'radius': 7.11163334156171}
Assigning component_of_node[1007] to 595

node 1006 node_info {'idx_start': 9667, 'idx_end': 9687, 'is_leaf': 1, 'radius': 7.4127405063081335}
Assigning component_of_node[1006] to 595

node 1005 node_info {'idx_start': 9648, 'idx_end': 9667, 'is_leaf': 1, 'radius': 7.865136144013082}
Assigning component_of_node[1005] to 595

node 1004 node_info {'idx_start': 9628, 'idx_end': 9648, 'is_leaf': 1, 'radius': 6.531784649917556}
Assigning component_of_node[1004] to 595

node 1003 node_info {'idx_start': 9609, 'idx_end': 9628, 'is_leaf': 1, 'radius': 7.187530726925463}
Assigning component_of_node[1003] to 595

node 1002 node_info {'idx_start': 9589, 'idx_end': 9609, 'is_leaf': 1, 'radius': 11.092337478957647}
Assigning component_of_node[1002] to 595

node 1001 node_info {'idx_start': 9570, 'idx_end': 9589, 'is_leaf': 1, 'radius': 6.574716927970526}
Assigning component_of_node[1001] to 595

node 

single_linkage_tree [[8.30000000e+01 7.10000000e+01 4.01343256e-01 2.00000000e+00]
 [8.40000000e+01 1.00000000e+04 4.01343256e-01 3.00000000e+00]
 [7.70000000e+01 1.00010000e+04 4.41447079e-01 4.00000000e+00]
 ...
 [4.33500000e+03 1.99950000e+04 9.95923233e+00 9.99800000e+03]
 [9.51300000e+03 1.99960000e+04 1.14168892e+01 9.99900000e+03]
 [8.78100000e+03 1.99970000e+04 1.39045029e+01 1.00000000e+04]]
min_spanning_tree.shape (9999, 3) single_linkage_tree.shape (9999, 4)
Completed in 11.835022926330566 seconds
Coverage: 0.2517
['so', 'still', 'even', 'too', 'though', 'having', 'yet', 'thought', 'fact', 'probably', 'reason', 'seems', 'actually', 'perhaps', 'clearly', 'indeed', 'nevertheless', 'hardly', 'likewise']
-----------------
['them', 'him', 'united', 'during', 'may', 'while', 'where', 'states', 'now', 'city', 'made', 'like', 'between', 'did', 'just', 'national', 'day', 'country', 'under', 'group', 'any', 'through', 'being', 'down', 'back', 'off', 'american', 'minister', 'police', '

In [35]:
from hdbscan.prediction import approximate_predict

preds = approximate_predict(hdbscan_clusterer, embeddings)

In [42]:
hdbscan_label_to_words = get_label_to_words(
    labels=preds[0], vocab_words=vocab_words)
for l in hdbscan_label_to_words:
    print(hdbscan_label_to_words[l][:100])
    print("-----------------")

['the', 'of', 'to', 'in', 'a', '"', "'s", '-', 'on', 'is', 'was', 'said', 'he', 'by', 'at', '(', ')', 'from', 'his', "''", '``', 'an', 'has', 'were', 'who', 'they', 'had', 'i', 'will', 'their', ':', 'or', 'its', 'after', 'new', 'been', "'", 'first', 'about', 'up', 'year', 'there', 'all', '--', 'she', 'people', 'her', 'percent', 'than', 'over', 'into', 'government', 'time', '$', 'you', 'years', 'no', 'world', 'can', ';', 'president', 'state', 'million', 'us', '_', 'against', 'u.s.', 'them', 'him', 'united', 'during', 'may', 'since', 'where', 'states', 'city', 'made', 'like', 'between', 'national', 'day', 'country', 'under', 'such', 'second', 'then', 'company', 'group', 'any', 'through', 'china', 'being', 'down', 'war', 'back', 'off', 'american', 'minister', 'police', 'including']
-----------------
[',', '.', 'and', 'for', 'that', 'with', 'as', 'it', 'be', 'are', 'have', 'but', 'not', 'this', 'which', 'one', 'also', 'we', 'would', 'more', 'when', 'out', 'other', "n't", 'some', 'if', 'do'

In [48]:
sum(preds[1] > 0), len(preds[1])

(6005, 400000)

In [50]:
max_ind = 40000
kmeans_clusterer = KMeans(n_clusters=len(hdbscan_label_to_words), verbose=False)
kmeans_clusterer.fit(embeddings[:max_ind])
label_to_words = get_label_to_words(labels=kmeans_clusterer.labels_, vocab_words=vocab_words)
for l in label_to_words:
    print(label_to_words[l][:100])
    print("-----------------")

['the', ',', '.', 'of', 'to', 'and', 'in', 'a', "'s", 'for', '-', 'that', 'on', 'was', 'with', 'he', 'as', 'it', 'by', 'at', 'from', 'his', 'an', 'be', 'has', 'have', 'but', 'were', 'this', 'who', 'they', 'had', 'which', 'their', 'its', 'one', 'after', 'new', 'been', 'also', 'two', 'more', 'first', 'about', 'up', 'when', 'year', 'there', 'all', '--', 'out', 'other', 'people', 'than', 'over', 'into', 'last', 'some', 'time', 'years', 'world', 'three', ';', 'only', 'most', '_', 'against', 'during', 'before', 'may', 'since', 'many', 'while', 'where', 'now', 'made', 'between', 'day', 'under', 'then', 'through', 'four', 'being', 'down', 'war', 'back', 'off', 'well', 'still', 'both', 'high', 'part', 'those', 'end', 'work', 'home', 'house', 'later', 'another', 'long']
-----------------
['"', "''", '``', 'not', 'i', ':', 'we', "'", 'she', "n't", 'you', 'if', 'no', 'can', 'do', 'so', 'them', 'what', 'him', 'because', 'like', 'did', 'just', 'even', 'our', 'get', 'way', 'much', '?', 'very', 'my', 