-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Why it’s Important:
The goal is to define a comprehensive list of topics that can be used across all resources in our infrastructure for consistent topic extraction. By creating a centralized, scalable source of topic definitions, we ensure that Large Language Models (LLMs) have access to a predefined list for accurate extraction and tagging. This enables users to explore specific topics efficiently across a wide range of Bitcoin-related sources.
Current Limitations:
Right now, our method for managing topics relies heavily on fetching topics from Optech, which is only integrated with Bitcoin Transcripts. This approach isn’t scalable across our broader data infrastructure. The current setup involves fetching and merging Optech Topics directly within the Bitcoin Transcripts repository. While useful, it limits broader adoption across our other products.
Here’s an overview of what we currently do:
In the Bitcoin Transcripts registry, we fetch Optech Topics and extract the following information for each topic using the CategoryInfo type:
{
title: string;
slug: string;
optech_url: string;
categories: string[];
aliases?: string[];
excerpt: string;
}We also have topic tags that aren’t included in the original Optech Topics index, which are currently listed under Misc.
Additionally, we have a cron job called "Topic Modeling generation" that runs in the llm-fine-tuning repo. This job queries Elasticsearch for documents that don’t yet have topic modeling applied across specified sources. Using GPT and a predefined list of Bitcoin-related topics, it generates primary and secondary topics for each document. However, the list of topics used in this process is now outdated and doesn’t align with our new centralized topic list. We also no longer need both primary and secondary topics—one set of topics will suffice. Moreover, afaik the quality of the tags generated by this cron job has never been formally evaluated, and we never utilized them in any product.
New Approach:
To build a scalable infrastructure for topics extraction and tagging across all products, we need a centralized repository that defines topic tags and categories. This repository will replace the current Optech-only approach and be used by multiple applications, including Bitcoin Transcripts. Our new topic extraction code will fetch these predefined topics, ensuring consistent application across all resources.
Important Clarification:
It’s important to emphasize that this topics index is not competing with our "Decoding Bitcoin" product. While "Decoding Bitcoin" started as an index of topics, it has since evolved into a broader resource. The topics index described here focuses solely on providing concise excerpts for each topic, not full definitions or in-depth content. Its purpose is to help us better define topic tags for efficient extraction and tagging across our infrastructure.
Implementation Plan:
-
Centralized Topic Repository:
We will create a new repository to house all topic tags and categories. This will replace the current process of fetching and merging topics within Bitcoin Transcripts. The repository will be accessible to multiple projects, ensuring consistency in tagging and extraction. -
Standardized Topic Information:
- We’ll start by using topics from Optech, following the
CategoryInfoschema used in the Bitcoin Transcripts Registry. - We’ll introduce additional topic tags that aren’t found in Optech, covering a wider range of topics from sources like Stack Exchange and Delving Bitcoin.
- We'll also align the "Topic Modeling generation" cron job with this new repository to ensure it uses an up-to-date list of topics. The current process of generating primary and secondary topics will be simplified to generate only one set of topics. Finally, we need to evaluate the quality of the existing tags generated by this process to ensure consistency and accuracy.
- We’ll start by using topics from Optech, following the
-
Integration Across Products:
- The new topics list will be consumed not only by Bitcoin Transcripts but by other parts of our ecosystem, such as Bitcoin Search and the scraper.
- This integration will provide a unified experience for tagging and exploring content across all platforms.
Future Vision:
As this repository evolves, it could eventually become a standalone page, similar to a glossary, on the main Bitcoin Dev Project website. This would provide an additional resource for users to explore key Bitcoin topics in a more structured and comprehensive way.