# Measuring Intangible Investment from 10-K Filings

This notebook walks through the full n-gram pipeline applied to SEC 10-K filings.

**Research question:** What fraction of a firm's SG&A spending goes toward intangible investment (R&D, brand capital, organizational capital) versus routine operating expenses?

**Approach:** Extract noun phrases from 10-K filings, cluster them into semantic communities, hand-label those communities as types of intangible investment, then score each filing against the labeled taxonomy.

## Setup

Make sure you've installed dependencies:
```bash
pip install -r ../../requirements.txt
```

In [None]:
import os
os.chdir('../..')  # Move to ngram_pipeline root
print('Working directory:', os.getcwd())

## Stage 1: Text Extraction

This stage reads raw 10-K files, strips HTML tags, and extracts the Item 7 (MD&A) section using configurable regex patterns.

In [None]:
!python 01_extract_text.py --config examples/intangible_investment/config_intangible.yaml

In [None]:
# Check what was extracted
import glob
texts = glob.glob('output/intangible_investment/texts/*.txt')
print(f'{len(texts)} text files extracted')
if texts:
    with open(texts[0]) as f:
        content = f.read()
    print(f'\nFirst file: {os.path.basename(texts[0])}')
    print(f'Length: {len(content.split())} words')
    print(f'Preview: {content[:500]}...')

## Stage 2: N-Gram Clustering

Extracts POS-filtered bigrams/trigrams, embeds them, and clusters into semantic communities.

**Note:** With a small sample corpus, you'll get fewer communities than the full analysis (which produced ~248 from 10,000 n-grams across thousands of filings).

In [None]:
!python 02_cluster_ngrams.py --config examples/intangible_investment/config_intangible.yaml

In [None]:
import pandas as pd

# Look at top n-grams
master = pd.read_csv('output/intangible_investment/clusters/ngram_master_list.csv')
print(f'Top n-grams: {len(master)}')
master.head(20)

In [None]:
# Look at discovered communities
labels = pd.read_csv('output/intangible_investment/clusters/community_results/community_labels_k500.csv')
print(f'Communities discovered: {len(labels)}')
labels.head(10)

## Manual Labeling Step

In the full analysis, we reviewed 231 communities and labeled each one. The `representatives` column shows the most central n-grams in each community.

For example:
- Community with representatives `advertising promotion expense, advertising expense cost` → **Intangible investment / brand or customer capital**
- Community with representatives `function research development, research development work` → **Intangible investment / knowledge capital**
- Community with representatives `cost office rent, rent expense office` → **Not intangible investment**

The pre-labeled file is at `examples/intangible_investment/labeled_communities.csv`.

In [None]:
labeled = pd.read_csv('examples/intangible_investment/labeled_communities.csv')
print(f'Hand-labeled communities: {len(labeled)}')
print(f'\nCategory distribution:')
print(labeled['category'].value_counts())
print(f'\nSubcategory distribution (within Intangible investment):')
print(labeled[labeled['category'] == 'Intangible investment']['subcategory'].value_counts())

## Stage 3: Per-Document N-Gram Extraction

Re-extracts n-grams from each document and builds the master mapping from clustering output + labeled communities.

In [None]:
!python 03_extract_doc_ngrams.py --config examples/intangible_investment/config_intangible.yaml

In [None]:
# Check the master mapping
mapping = pd.read_csv('output/intangible_investment/master_ngram_mapping.csv')
print(f'Master mapping: {len(mapping)} n-grams mapped to categories')
print(f'\nCategory breakdown:')
print(mapping['category'].value_counts())

## Stage 4: Document Scoring

Scores each document against the labeled communities using cosine-weighted n-gram counts, producing probability distributions over categories.

In [None]:
!python 04_score_documents.py --config examples/intangible_investment/config_intangible.yaml

In [None]:
# View final scores
scores = pd.read_csv('output/intangible_investment/scores/scores_category_prob_embedding.csv')
print(f'Documents scored: {len(scores)}')
scores.head(10)

In [None]:
# Subcategory breakdown (within intangible investment)
sub_scores = pd.read_csv('output/intangible_investment/scores/scores_subcategory_prob_embedding.csv')
sub_scores.head(10)

In [None]:
# Visualize score distributions
import matplotlib.pyplot as plt

if len(scores) > 0:
    score_cols = [c for c in scores.columns if c != 'doc_id']
    means = scores[score_cols].mean()
    
    fig, ax = plt.subplots(figsize=(10, 5))
    means.plot(kind='bar', ax=ax)
    ax.set_ylabel('Average probability')
    ax.set_title('Average category scores across documents')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

## Adapting to Your Own Corpus

To apply this pipeline to a different corpus or topic:

1. **Replace the input documents** — put your text files in a directory and point `extract.input_dir` to it
2. **Adjust the config** — disable section extraction if your documents don't have sections; update `custom_stop_words` for your domain
3. **Run Stages 1-2** — this will discover communities specific to your corpus
4. **Label the communities** — open the CSV, review representatives, add your own `category` and `subcategory` labels
5. **Run Stages 3-4** — score your documents against your taxonomy

The pipeline is domain-agnostic. The same algorithms work for news articles, scientific papers, legal documents, or any other text corpus.