PyTorch implementation of Clustering as Reasoning: A K-Means Interpretation of Chain-of-Thought Graph Learning for ICML 2026.
KCoT performs graph pretraining, generates structural and feature-space prompts, refines node semantics with an LLM, encodes the refined text with a sentence-transformer, and feeds the refined embeddings back into downstream graph learning.
.
|-- dataset
| |-- cora
| | |-- node_info.csv
| | |-- processed_data.pt
| | |-- processed_data_link_notest.pt
| | |-- simteg_sbert_x.pt
| | |-- simteg_roberta_x.pt
| | |-- simteg_e5_x.pt
| | |-- node_summaries.csv
| | |-- prompt
| | `-- 1
| |-- pubmed
| |-- ogbn-arxiv
| `-- ogbn-products
|-- llm
| |-- all-mpnet-base-v2
| `-- vicuna-7b-v1.5-16k
|-- config.py
|-- dataloader.py
|-- gcn.py
|-- main.py
|-- model.py
|-- preprocess.py
|-- use_llm.py
|-- use_llm_API.py
|-- utils.py
|-- requirements.txt
`-- README.md
Large data files, checkpoints, generated CSV files, and local LLM weights are not intended to be committed to Git.
We provide Cora as an example dataset, including several intermediate generated files. You can download it from:
KCoT Cora example data on Google Drive
Place the extracted files under:
dataset/cora/
The only additional text-attributed-graph file used by this project is:
dataset/<dataset_name>/node_info.csv
node_info.csv should contain:
paper_id: the external node or paper ID used in prompts.title: node text title.abstract: node text content.
All other graph data and feature files follow the data format used by LLaGA. In particular, KCoT expects PyTorch Geometric processed graph files and SimTeG feature tensors compatible with that format:
processed_data.pt
processed_data_link_notest.pt
simteg_sbert_x.pt
simteg_roberta_x.pt
simteg_e5_x.pt
KCoT uses two local Hugging Face models by default:
- sentence-transformers/all-mpnet-base-v2 for encoding refined LLM text.
- lmsys/vicuna-7b-v1.5-16k for local LLM inference.
The default local layout is:
llm/all-mpnet-base-v2/
llm/vicuna-7b-v1.5-16k/
You can override these paths in config.py or with environment variables:
export KCOT_EMBED_MODEL_PATH=/path/to/all-mpnet-base-v2
export KCOT_LLM_MODEL_PATH=/path/to/vicuna-7b-v1.5-16kuse_llm.py is the default local Vicuna backend. use_llm_API.py is kept as an optional OpenAI-compatible API backend and writes _api_llm.csv outputs instead of overwriting local _local_llm.csv outputs.
Install PyTorch and PyTorch Geometric according to your CUDA version first. For example, with CUDA 11.8 and PyTorch 2.1:
pip install torch==2.1.0 --index-url https://download.pytorch.org/whl/cu118
pip install pyg-lib torch-scatter torch-sparse torch-cluster torch-spline-conv -f https://data.pyg.org/whl/torch-2.1.0+cu118.html
pip install -r requirements.txtIf your CUDA or PyTorch version is different, adjust the PyG wheel URL accordingly.
The main entry point is:
python main.pyBy default, the project runs node classification on Cora. The dataset, task, paths, model settings, and hyperparameters are configured in config.py.
Run node classification:
export KCOT_TASK=nc
python main.pyRun link prediction:
export KCOT_TASK=lp
python main.pyDuring training, KCoT creates intermediate files under dataset/<dataset_name>/.
Prompt files:
prompt/<dataset>_structural_prompts.csv
prompt/<dataset>_fusion_knn_prompts.csv
prompt/<dataset>_prompts_thought_<thought>.csv
LLM outputs:
node_summaries.csv
<epoch>/<dataset>_refined_text_structural_local_llm.csv
<epoch>/<dataset>_refined_text_fusion_knn_local_llm.csv
Refined embedding files:
<epoch>/<thought>_thought_embeddings.pt
<epoch>/<dataset>_refined_structural_emb.pt
<epoch>/<dataset>_refined_fusion_knn_emb.pt
Long LLM jobs are resumable. Incomplete CSV files use the .partial suffix and are finalized automatically when all expected paper_id rows are present.
Edit config.py to change:
- dataset name and task (
ncorlp); - dataset, checkpoint, and local LLM paths;
- GCN dimensions and training epochs;
- downstream learning rate and weight decay;
- number of KCoT thoughts;
- local Vicuna generation parameters;
- optional API backend settings.