# Building a Dynamic Industry Classification System with Captide’s API

Traditional industry codes (GICS, NAICS) stick each firm in one box, but modern companies rarely fit neatly—think Amazon spanning retail, cloud, and media. Static labels also lag behind new fields like AI infrastructure or DeFi, hiding real risk links.

Dynamic Industry Classification (DIC) fixes this by clustering firms on live data—product lines, revenue segments, SEC-filing language—so a company can sit in several groups that shift as its business evolves. These data-driven clusters track financial behavior (e.g., return correlations) better than one-size codes, improving peer analysis, factor models, and portfolio design.

Captide’s API supplies the fuel by allowing to provide clean information pulled from 10-Ks and 10-Qs with LLMs, eliminating manual parsing. In the tutorial that follows, we’ll:
1. Fetch product/service text with Captide.
2. Turn it into embeddings.
3. Cluster similar companies.
4. Auto-label clusters with GPT.
5. Tabulate each firm’s multi-cluster exposure.

You’ll get runnable Python (requests, sentence_transformers, sklearn, pandas) plus Captide and OpenAI keys—enough to build a DIC pipeline that reflects today’s market, not yesterday’s categories.

### 1. Pull product-and-service text with Captide
The first step is to collect data that accurately reflects what each company actually does. For this, we focus on the official descriptions of products and services found in regulatory filings. Captide’s API streamlines this process. It allows users to submit natural language queries and receive precise, source-backed responses extracted directly from filings. This eliminates the need for manual PDF parsing or keyword scraping.

In the example below, we query Captide’s `/rag/agent-query-stream` endpoint. The query asks: “List all the products or services of the company in a dictionary where each key is the name of the product and the value is a brief description.” We apply this query to a list of companies, retrieving data from their most recent 10-K, 10-Q, or 8-K filings. While our simplified approach uses only the latest reports, a more advanced implementation could iterate over historical filings to track how business activities evolve over time.

In [13]:
import os
import re
import json
import datetime as dt
from concurrent.futures import ThreadPoolExecutor, as_completed
import requests

CAPTIDE_API_KEY = os.getenv("CAPTIDE_API_KEY")

TICKERS = ['AAPL', 'ABBV', 'ABT', 'ACN', 'ADBE', 'AIG', 'AMD', 'AMGN', 'AMT', 'AMZN', 'AVGO', 'AXP', 'BA', 'BAC', 'BK', 'BKNG', 'BLK', 'BMY', 'BRK.B', 'C', 'CAT', 'CHTR', 'CL', 'CMCSA', 'COF', 'COP', 'COST', 'CRM', 'CSCO', 'CVS', 'CVX', 'DE', 'DHR', 'DIS', 'DUK', 'EMR', 'FDX', 'GD', 'GE', 'GILD', 'GM', 'GOOG', 'GS', 'HD', 'HON', 'IBM', 'INTC', 'INTU', 'ISRG', 'JNJ', 'JPM', 'KO', 'LIN', 'LLY', 'LMT', 'LOW', 'MA', 'MCD', 'MDLZ', 'MDT', 'MET', 'META', 'MMM', 'MO', 'MRK', 'MS', 'MSFT', 'NEE', 'NFLX', 'NKE', 'NOW', 'NVDA', 'ORCL', 'PEP', 'PFE', 'PG', 'PLTR', 'PM', 'PYPL', 'QCOM', 'RTX', 'SBUX', 'SCHW', 'SO', 'SPG', 'T', 'TGT', 'TMO', 'TMUS', 'TSLA', 'TXN', 'UNH', 'UNP', 'UPS', 'USB', 'V', 'VZ', 'WFC', 'WMT', 'XOM']

LOOKBACK_DAYS = 90
START_DATE = (dt.date.today() - dt.timedelta(days=LOOKBACK_DAYS)).isoformat()
END_DATE = dt.date.today().isoformat()

HEADERS = {
    "X-API-Key": CAPTIDE_API_KEY,
    "Content-Type": "application/json",
    "Accept": "application/json",
}

QUERY_TEMPLATE = (
    "List all the products or services of the company in a dictionary where each key is the name "
    "of the product and the value is a brief description. In the description don't include company "
    "or brand names, just a description of the products or services offered. Don't include any "
    "introductory text or outro in the response, just the dictionary."
)

def _extract_dict(text: str):
    m = re.search(r'"type":"full_answer","content":"(.*?)"}', text, re.DOTALL)
    if not m:
        return None
    cleaned = m.group(1).encode().decode("unicode_escape")
    cleaned = re.sub(r"\s*\[#\w+\]", "", cleaned)
    try:
        return json.loads(cleaned)
    except json.JSONDecodeError:
        return None

def fetch_products_one(ticker: str):
    payload = {
        "query": QUERY_TEMPLATE,
        "tickers": [ticker],
        "sourceType": ["10-K", "10-Q", "8-K"],
        "startDate": START_DATE,
        "endDate": END_DATE,
    }
    try:
        resp = requests.post(
            "https://rest-api.captide.co/api/v1/rag/agent-query-stream",
            json=payload,
            headers=HEADERS,
            timeout=300,
        )
        return ticker, _extract_dict(resp.text)
    except Exception as err:
        return ticker, None

all_products = {}
with ThreadPoolExecutor(max_workers=10) as pool:
    futures = [pool.submit(fetch_products_one, t) for t in TICKERS]
    for fut in as_completed(futures):
        tic, prod = fut.result()
        if prod:
            all_products[tic] = prod

print(json.dumps(all_products["AAPL"], indent=2, ensure_ascii=False))

{
  "iPhone": "Smartphones featuring advanced camera systems, high-performance processors, and a range of models including entry-level and professional options, designed for communication, productivity, and entertainment.",
  "Mac": "Personal computers available as laptops and desktops, equipped with proprietary silicon chips, offering high performance for professional and personal use, including models optimized for artificial intelligence workloads.",
  "iPad": "Tablets designed for portability and versatility, supporting productivity, creativity, and entertainment, with models featuring high-performance chips and compatibility with stylus accessories.",
  "Wearables, Home and Accessories": "A category including smartwatches with health and fitness tracking, wireless earbuds with advanced audio features, spatial computing headsets, and a range of accessories for personal and home use.",
  "Apple Watch Series 10": "A smartwatch offering health and fitness tracking, notifications, and 

### Step 2: Generate Embeddings Using a Sentence-Transformers Model

With a structured set of product and service descriptions in hand, the next step is to convert these texts into numerical embeddings—dense vector representations that capture the semantic meaning of each description. Embeddings are essential for comparing textual content: descriptions that refer to similar business activities (e.g., cloud infrastructure, digital payments, or logistics) will be positioned close to each other in the embedding space. This enables us to identify clusters of companies based on shared operational themes, even if the language they use differs.

To generate these embeddings, we’ll use a pre-trained model from the Sentence Transformers library. For this example, we’ll use the `all-MiniLM-L6-v2` model, which offers a strong balance between speed and accuracy for general-purpose sentence embedding tasks.

In [14]:
from sentence_transformers import SentenceTransformer

EMBED_MODEL = "all-MiniLM-L6-v2"

texts, meta = [], []

for ticker, prod_dict in all_products.items():
    for name, desc in prod_dict.items():
        texts.append(desc)
        meta.append({"ticker": ticker, "product": name})

embedder = SentenceTransformer(EMBED_MODEL)
embeddings = embedder.encode(texts, batch_size=256, show_progress_bar=True)

Batches: 100%|██████████| 6/6 [00:02<00:00,  2.84it/s]


### Step 3: Cluster Embeddings and Assign Descriptive Labels Using GPT

With our product and service embeddings prepared, the next step is to identify patterns by grouping similar vectors together. This unsupervised clustering process allows us to uncover latent industry groupings that emerge directly from the data—free from any predefined categories.

We’ll use the K-means algorithm to partition the embedding space into `k` clusters. Each cluster represents a group of descriptions with similar semantic content, ideally corresponding to a coherent business domain or functional area. For this demonstration, we’ll set `k=30`, though this number can be adjusted based on dataset size, diversity, or downstream analytical needs.

After clustering, we’re left with numbered clusters (e.g., Cluster 0 to Cluster 29). These numeric labels are arbitrary and lack interpretability. To make the output more useful, we’ll generate descriptive names for each cluster using OpenAI’s GPT model.

For each cluster, we sample a few representative descriptions and prompt GPT to synthesize a concise, human-readable label that captures the common theme. This step leverages the model’s strength in summarization and abstraction, allowing us to convert raw groupings into meaningful industry labels.

In [15]:
from sklearn.cluster import KMeans
from collections import defaultdict
import json as _json
from openai import OpenAI
from tqdm import tqdm as _tqdm
import pprint

N_CLUSTERS = 30
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=OPENAI_API_KEY)

km = KMeans(n_clusters=N_CLUSTERS, n_init="auto", random_state=42)
labels = km.fit_predict(embeddings)

cluster_to_desc = defaultdict(list)
for idx, cid in enumerate(labels):
    cluster_to_desc[cid].append({
        "ticker": meta[idx]["ticker"],
        "product": meta[idx]["product"],
        "description": texts[idx],
    })

cluster_names = {}
for cid, desc_list in _tqdm(cluster_to_desc.items(), desc="Clusters"):
    bullets = "\n".join(f"- {d}" for d in desc_list[:10])
    system = "You are a market-structure analyst. Name the common theme."
    prompt = ("Return ONLY valid JSON: {\"label\": string, \"confidence\": int (0-100)}\n\n" + bullets)
    try:
        resp = client.chat.completions.create(
            model="gpt-4o",
            temperature=0.2,
            max_tokens=50,
            response_format={"type": "json_object"},
            messages=[
                {"role": "system", "content": system},
                {"role": "user", "content": prompt},
            ],
        )
        label = _json.loads(resp.choices[0].message.content.strip())["label"].strip()
    except Exception:
        label = "Miscellaneous"
    cluster_names[cid] = label

pprint.pprint(cluster_names)

Clusters: 100%|██████████| 30/30 [00:28<00:00,  1.06it/s]

{np.int32(0): 'Energy and Power Generation',
 np.int32(1): 'Payment and Transaction Services',
 np.int32(2): 'Consumer Technology and Digital Services',
 np.int32(3): 'Food and Beverage Industry',
 np.int32(4): 'Lending and Financing Services',
 np.int32(5): 'Cancer Treatment Therapies',
 np.int32(6): 'Industrial and Commercial Equipment and Solutions',
 np.int32(7): 'Snack Foods',
 np.int32(8): 'Industrial and Defense Technology Solutions',
 np.int32(9): 'Pharmaceutical Products',
 np.int32(10): 'Digital Transformation and Customer Experience Optimization',
 np.int32(11): 'Beverage Products by Coca-Cola',
 np.int32(12): 'Integration of AI and Cloud Technologies',
 np.int32(13): 'Advanced Computing and Processing Technologies',
 np.int32(14): 'Energy and Fuel Solutions',
 np.int32(15): 'Construction and Heavy Equipment',
 np.int32(16): 'Biopharmaceutical Treatments',
 np.int32(17): 'Logistics and Transportation Services',
 np.int32(18): 'Pharmaceuticals targeting metabolic and cardiova




### Step 4: Summarize Cluster Distribution at the Company Level

In the final step, we shift from product-level analysis to a company-level view. The goal is to understand how each company is represented across the dynamically generated industry clusters.

To summarize the distribution, we construct a `pandas` DataFrame listing each company along with the cluster assignments of its product and service descriptions. We then aggregate the results to show how many entries from each company fall into each cluster.

While our example treats each description equally, a more advanced version could weight entries by financial relevance—such as segment revenue or operating income—for a more economically meaningful representation.

In [16]:
import pandas as pd
from collections import Counter

totals = Counter(m["ticker"] for m in meta)
per_company_cluster = defaultdict(Counter)
for i, cid in enumerate(labels):
    ticker = meta[i]["ticker"]
    per_company_cluster[meta[i]["ticker"]][cid] += 1

rows = []
for tic in sorted(per_company_cluster):
    for cid, cnt in per_company_cluster[tic].items():
        pct = round(100 * cnt / totals[tic], 2)
        rows.append(
            {
                "Ticker": tic,
                "ClusterID": cid,
                "ClusterName": cluster_names[cid],
                "Products": cnt,
                "% of Ticker Products": pct,
            }
        )

summary_df = pd.DataFrame(rows).sort_values(["Ticker", "% of Ticker Products"], ascending=[True, False])
display(summary_df)

Unnamed: 0,Ticker,ClusterID,ClusterName,Products,% of Ticker Products
2,AAPL,2,Consumer Technology and Digital Services,11,57.89
3,AAPL,21,Healthcare and Insurance Solutions,2,10.53
6,AAPL,12,Integration of AI and Cloud Technologies,2,10.53
0,AAPL,26,Telecommunications and Connectivity Services,1,5.26
1,AAPL,13,Advanced Computing and Processing Technologies,1,5.26
...,...,...,...,...,...
369,WMT,29,Diversified Revenue Streams,1,14.29
370,WMT,4,Lending and Financing Services,1,14.29
371,XOM,14,Energy and Fuel Solutions,14,66.67
373,XOM,24,Personal and Beauty Care Products,4,19.05


### Conclusion

In this post, we introduced the concept of Dynamic Industry Classification (DIC) and demonstrated how to build a simple yet powerful DIC system using Captide’s API, combined with modern natural language processing tools.

This approach illustrates how AI can unlock deeper insights in financial analysis. With relatively little code, we replaced rigid, static industry labels with a data-driven, multidimensional view of how companies operate.

Captide’s API played a central role—providing clean, on-demand access to the information buried in filings, which served as the foundation of our classification. In a real-world scenario, this framework could be expanded to cover hundreds or thousands of companies and updated continuously as new filings arrive—offering analysts a living, evolving map of the business landscape.

### Potential Improvements and Next Steps

Our implementation is a basic prototype of a dynamic industry classification. There are many ways to improve and extend this approach:

1. **Improve Accuracy with Better Models:** We used a general-purpose MiniLM model, but accuracy could improve with domain-specific or more powerful models. Advanced clustering methods (e.g., DBSCAN, spectral) might also capture more natural groupings than K-means.

2. **Hierarchical Clustering:** Industry data is hierarchical (sectors → industries → sub-industries). Hierarchical clustering could reflect this structure, creating a dynamic taxonomy instead of a flat set of clusters.

3. **Fit into current GICS Classification:** An alternative approach would be to index the descriptions of current industries and subindustries within GICS and cluster companies' products based on proximity to each category. Outliers could be grouped into new clusters.

4. **Soft Clustering & Multi-label Classification:** Instead of assigning each description to one cluster, soft clustering or topic modeling can reflect overlaps—e.g., a product belonging partly to multiple categories. This better mirrors reality and supports embedding-based probability distributions over sectors.

5. **More Data Sources:** Beyond 10-Ks, data like earnings calls, investor presentations, and proxy statements can add valuable context. Captide ingests these too, helping track trends like rising mentions of “AI initiatives.”

6. **Automation & Scaling:** A production system would automate data updates, re-clustering, and monitor cluster stability. Evaluation—both qualitative and quantitative—could ensure relevance and detect shifts (e.g., from acquisitions or strategy changes).