# Measuring Intangible Investment from 10-K Filings

This notebook walks through the full n-gram pipeline applied to SEC 10-K filings.

**Research question:** What fraction of a firm's SG&A spending goes toward intangible investment (R&D, brand capital, organizational capital) versus routine operating expenses?

**Approach:** Extract noun phrases from 10-K filings, cluster them into semantic communities, hand-label those communities as types of intangible investment, then score each filing against the labeled taxonomy.

## Setup

Make sure you've installed dependencies:
```bash
pip install -r ../../requirements.txt
```

In [None]:
import os
os.chdir('../..')  # Move to ngram_pipeline root
print('Working directory:', os.getcwd())

## Stage 1: Text Extraction

This stage reads raw 10-K files, strips HTML tags, and extracts the Item 7 (MD&A) section using configurable regex patterns.

In [None]:
!python 01_extract_text.py --config examples/intangible_investment/config_intangible.yaml

In [None]:
# Check what was extracted
import glob
texts = glob.glob('output/intangible_investment/texts/*.txt')
print(f'{len(texts)} text files extracted')
if texts:
    with open(texts[0]) as f:
        content = f.read()
    print(f'\nFirst file: {os.path.basename(texts[0])}')
    print(f'Length: {len(content.split())} words')
    print(f'Preview: {content[:500]}...')

## Stage 1b: LLM-Based Text Extraction

The paper uses an LLM to extract SG&A-focused text from each document's Item 7 section before building the n-gram dictionary. For each document, the LLM extracts three types of quotes:
- **Definition quotes** — what costs make up the expense
- **Business driver quotes** — why the company incurs these costs
- **Change driver quotes** — what caused the expense to increase/decrease

This focuses the dictionary on expense-relevant language rather than all text in the filing.

**Requirements:** Either [Ollama](https://ollama.ai) running locally (free) or an OpenAI API key. Configure in `config_intangible.yaml` under `llm_extract.provider`.

In [None]:
!python 01b_llm_extract.py --config examples/intangible_investment/config_intangible.yaml

In [None]:
# Check LLM extracts
extracts = glob.glob('output/intangible_investment/llm_extracts/*.txt')
print(f'{len(extracts)} LLM extract files')
if extracts:
    with open(extracts[0]) as f:
        content = f.read()
    print(f'\nFirst file: {os.path.basename(extracts[0])}')
    print(f'Quotes extracted: {len(content.splitlines())}')
    print(f'Preview:\n{content[:500]}...')

## Stage 2: N-Gram Clustering

Extracts POS-filtered bigrams/trigrams from the LLM-extracted text (if Stage 1b was run) or full documents, embeds them, and clusters into semantic communities.

**Note:** With a small sample corpus, you'll get fewer communities than the full analysis (which produced ~248 from 10,000 n-grams across thousands of filings).

In [None]:
!python 02_cluster_ngrams.py --config examples/intangible_investment/config_intangible.yaml

In [None]:
import pandas as pd

# Look at top n-grams
master = pd.read_csv('output/intangible_investment/clusters/ngram_master_list.csv')
print(f'Top n-grams: {len(master)}')
master.head(20)

In [None]:
# Look at discovered communities
labels = pd.read_csv('output/intangible_investment/clusters/community_results/community_labels_k500.csv')
print(f'Communities discovered: {len(labels)}')
labels.head(10)

## Manual Labeling Step

**This is the key human-in-the-loop step.** After Stage 2, you need to open the `community_labels_k500.csv` output and label each community. The `representatives` column shows the most central n-grams — use these to decide what category each community represents.

Add two columns to the CSV:
- **`category`**: e.g., `Intangible investment`, `Not intangible investment`, `unknown`
- **`subcategory`** (optional): e.g., `knowledge capital`, `brand or customer capital`, `organization capital`

For example:
- `advertising promotion expense, advertising expense cost` → **Intangible investment / brand or customer capital**
- `function research development, research development work` → **Intangible investment / knowledge capital**
- `cost office rent, rent expense office` → **Not intangible investment**

After labeling, save the file and set `doc_ngrams.community_labels_csv` in `config_intangible.yaml` to point to it.

A reference file showing what completed labeling looks like (from the full-corpus analysis with ~231 communities) is at `examples/intangible_investment/labeled_communities_reference.csv`.

**Note:** The communities you get from this small sample corpus will differ from the reference — that's expected. The reference is just to show the labeling format.

In [None]:
# Look at the reference labeled communities (from the full-corpus analysis)
labeled = pd.read_csv('examples/intangible_investment/labeled_communities_reference.csv')
print(f'Reference labeled communities: {len(labeled)}')
print(f'\nCategory distribution:')
print(labeled['category'].value_counts())
print(f'\nSubcategory distribution (within Intangible investment):')
print(labeled[labeled['category'] == 'Intangible investment']['subcategory'].value_counts())
print(f'\nThis is what YOUR labeled CSV should look like after the manual step.')

## Stage 3: Per-Document N-Gram Extraction

Re-extracts n-grams from each document's **full Item 7 text** (not the LLM extracts) and builds the master mapping from clustering output + labeled communities. This means the scoring uses the complete document text scored against the focused dictionary.

In [None]:
!python 03_extract_doc_ngrams.py --config examples/intangible_investment/config_intangible.yaml

In [None]:
# Check the master mapping
mapping = pd.read_csv('output/intangible_investment/master_ngram_mapping.csv')
print(f'Master mapping: {len(mapping)} n-grams mapped to categories')
print(f'\nCategory breakdown:')
print(mapping['category'].value_counts())

## Stage 4: Document Scoring

Scores each document against the labeled communities using cosine-weighted n-gram counts, producing probability distributions over categories.

In [None]:
!python 04_score_documents.py --config examples/intangible_investment/config_intangible.yaml

In [None]:
# View final scores
scores = pd.read_csv('output/intangible_investment/scores/scores_category_prob_embedding.csv')
print(f'Documents scored: {len(scores)}')
scores.head(10)

In [None]:
# Subcategory breakdown (within intangible investment)
sub_scores = pd.read_csv('output/intangible_investment/scores/scores_subcategory_prob_embedding.csv')
sub_scores.head(10)

In [None]:
# Visualize score distributions
import matplotlib.pyplot as plt

if len(scores) > 0:
    score_cols = [c for c in scores.columns if c != 'doc_id']
    means = scores[score_cols].mean()
    
    fig, ax = plt.subplots(figsize=(10, 5))
    means.plot(kind='bar', ax=ax)
    ax.set_ylabel('Average probability')
    ax.set_title('Average category scores across documents')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

## Adapting to Your Own Corpus

To apply this pipeline to a different corpus or topic:

1. **Replace the input documents** — put your text files in a directory and point `extract.input_dir` to it
2. **Adjust the config** — disable section extraction if your documents don't have sections; update `custom_stop_words` for your domain
3. **Run Stages 1-2** — this will discover communities specific to your corpus
4. **Label the communities** — open the CSV, review representatives, add your own `category` and `subcategory` labels
5. **Run Stages 3-4** — score your documents against your taxonomy

The pipeline is domain-agnostic. The same algorithms work for news articles, scientific papers, legal documents, or any other text corpus.