# Exploring HN Comment Clusters

This notebook demonstrates how to work with the comment clustering system.

In [1]:
from comment_tree import CommentTree
from IPython.display import Image

# Load existing tree
tree = await CommentTree.loadFileOrFetch()
# Get root node
root = await tree.get_comment_by_path('')
len(root.raw_comments)
# Get first child's clusters
first_child = await tree.get_comment_by_path('0')
print(f"\nFirst comment: {first_child.text}")
print(f"Number of KMeans clusters: {len(first_child.kmeans_clusters)}")
print(f"Number of Louvain clusters: {len(first_child.louvain_clusters)}")

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Some weights of BertForMaskedLM were not initialized from the model checkpoint at microsoft/MiniLM-L12-H384-uncased and are newly initialized: ['cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.de

Processing KMeans clusters for 42763095
Processed 1 KMeans clusters
Processing Louvain clusters for 42763095
Processed 1 Louvain clusters


  return fit_method(estimator, *args, **kwargs)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


Fetched comment 42763321 {'by': 'spencerflem', 'id': 42763321, 'kids': [42763677, 42764574, 42765126], 'parent': 42763095, 'text': 'This is part of why I&#x27;ve been so excited about Genode&#x2F;Sculpt <a href="https:&#x2F;&#x2F;genode.org&#x2F;documentation&#x2F;articles&#x2F;sculpt-24-10" rel="nofollow">https:&#x2F;&#x2F;genode.org&#x2F;documentation&#x2F;articles&#x2F;sculpt-24-10</a><p>It&#x27;s tiny, clearly built with love for the user, doesn&#x27;t do a heck of a lot, and has some interesting ideas that are just fun to mess around in. And unlike some of the similar retrocomputing OS&#x27;s (which are also lovely but grounded in old fashioned design), genode feels like a glimpse into the good future.', 'time': 1737330141, 'type': 'comment'}
Fetched comment 42763677 {'by': 'abrookewood', 'id': 42763677, 'kids': [42763796], 'parent': 42763321, 'text': 'That looks like the most radical&#x2F;unusual operating system thing I have seen in recent memory. Not sure how practical it is, b

## Navigate the Tree
You can navigate to specific comments using paths

In [3]:
sec_child = await tree.get_comment_by_path('1')
sec_child.to_dict()


Processing KMeans clusters for 42763095
Processed 1 KMeans clusters
Processing Louvain clusters for 42763095
Processed 1 Louvain clusters


  return fit_method(estimator, *args, **kwargs)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


Fetched comment 42763398 {'by': 'xnx', 'id': 42763398, 'parent': 42763095, 'text': 'There&#x27;s plenty of Ed Zitron&#x27;s opinions I don&#x27;t agree with, but this is a really good quote:<p>&quot;Our economy isn’t one that produces things to be used, but things that increase usage.&quot;', 'time': 1737330779, 'type': 'comment'}
Fetched 0 children for 42763398


{'id': '42763398',
 'text': 'There&#x27;s plenty of Ed Zitron&#x27;s opinions I don&#x27;t agree with, but this is a really good quote:<p>&quot;Our economy isn’t one that produces things to be used, but things that increase usage.&quot;',
 'vector': [-0.1188383623957634,
  0.0482986681163311,
  -0.0025480890180915594,
  -0.011011185124516487,
  0.05195079743862152,
  0.010291739366948605,
  0.11543326824903488,
  0.0007007463136687875,
  -0.0859254002571106,
  -0.07065406441688538,
  0.0013317771954461932,
  -0.03547235205769539,
  0.018434101715683937,
  -0.006737193092703819,
  0.024402951821684837,
  -0.029503285884857178,
  -0.058138471096754074,
  -0.05043948069214821,
  -0.020765451714396477,
  0.02903599850833416,
  -0.06367600709199905,
  0.0240299291908741,
  0.026243362575769424,
  -0.006037388928234577,
  -0.011076593771576881,
  -0.0014006991405040026,
  -0.018619751557707787,
  0.03277008235454559,
  0.0028860417660325766,
  -0.05694398283958435,
  -0.04394165799021721,
  

## Visualize Clusters

In [None]:
# Generate and display visualizations
kmeans_path = first_child.visualize_clusters('output/example_kmeans', 'kmeans')
louvain_path = first_child.visualize_clusters('output/example_louvain', 'louvain')

print("KMeans Clusters:")
display(Image(filename=kmeans_path))

print("\nLouvain Clusters:")
display(Image(filename=louvain_path))

## Analyze Cluster Properties

In [None]:
# Function to analyze clusters
def analyze_clusters(node, cluster_type='kmeans'):
    clusters = node.kmeans_clusters if cluster_type == 'kmeans' else node.louvain_clusters
    print(f"\n{cluster_type.upper()} Clusters Analysis:")
    
    for i, cluster in enumerate(clusters):
        print(f"\nCluster {i}:")
        print(f"Summary: {cluster.summaryText}")
        print(f"Size: {len(cluster.clusterChildren)}")
        print(f"Average confidence: {sum(c['confidence'] for c in cluster.clusterChildren) / len(cluster.clusterChildren):.2f}")

analyze_clusters(first_child, 'kmeans')
analyze_clusters(first_child, 'louvain')