# Semantic Clustering of Corporate Business Profiles Extracted from Form 10-K Reports

* **Clean Text**: Remove HTML tags, punctuation, special characters, and normalize whitespace.
* **Lowercase**: Convert all text to lowercase.
* **Stopwords Removal**: Remove common English stopwords (e.g., “the”, “and”, “of”).
* **Tokenization**: Split text into words or meaningful n-grams.
* **Stemming/Lemmatization**: Reduce words to their root forms (optional).

In [11]:
import spacy
from rich import print

In [20]:
## python -m spacy download en_core_web_lg
## python -m spacy validate

# Load English model
# nlp = spacy.load("en_core_web_sm")
nlp = spacy.load("en_core_web_trf")

ValueError: [E002] Can't find factory for 'curated_transformer' for language English (en). This usually happens when spaCy calls `nlp.create_pipe` with a custom component name that's not registered on the current language class. If you're using a custom component, make sure you've added the decorator `@Language.component` (for function components) or `@Language.factory` (for class components).

Available factories: merge_noun_chunks, merge_entities, merge_subtokens, en.lemmatizer

In [8]:
text  = 'Asana is an enterprise work management software platform that unifies cross-functional teams so businesses can effectively set and track goals, drive strategic initiatives, and manage work effectively. Over 169,000 paying customers across 200 countries and territories use Asana to connect their work to company goals and orchestrate mission-critical workflows like product launches, employee onboarding, resource planning, tracking company-wide strategic initiatives and more. Our secure and scalable platform with AI-powered features adds structure to unstructured work, creating clarity, accountability, and impact for everyone within an organization—executives, department heads, team leads, and individuals. In Asana, everyone understands who is doing what, by when, how and why.'

In [9]:
# Process text
doc = nlp(text)

In [None]:
# Remove stopwords
# filtered_tokens = [token.text for token in doc if not token.is_stop and not token.is_punct]
# doc.to_json()
# print(filtered_tokens)

for token in doc:
    print(f"{token.text} → {token.lemma_}")


* **TF-IDF Vectorization**: Convert text into numerical vectors capturing term frequency and inverse document frequency — emphasizes important words.
* **Word Embeddings**: Use pretrained models (e.g., Word2Vec, GloVe, or transformers like BERT) to create dense vector representations that capture semantic meaning.
* **Domain Keywords**: Optionally, use domain-specific dictionaries or extract key phrases (e.g., “software”, “cloud services”, “pharmaceuticals”).

In [3]:
import spacy
nlp = spacy.load("en_core_web_trf")

In [13]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")

In [21]:
for token in doc:
    print(token.text, token.vector)  # print first 5 dims of token vector

Apple []
is []
looking []
at []
buying []
U.K. []
startup []
for []
$ []
1 []
billion []
. []


In [22]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


In [26]:
doc._.trf_data

DocTransformerOutput(all_outputs=[Ragged(data=array([[-0.63482094, -0.233602  ,  0.12225183, ...,  0.57827973,
        -0.82340395,  0.57881105],
       [-0.4423008 ,  0.42749116, -0.7890119 , ..., -0.6864278 ,
        -0.2860822 , -0.6672599 ],
       [ 0.2509021 , -0.05691053,  0.30914536, ..., -0.06898394,
        -0.85904   ,  0.54899323],
       ...,
       [-0.28108495,  0.02199169, -0.26524004, ...,  1.0873584 ,
        -0.15650374,  0.82378274],
       [-0.10340248,  0.66952765, -0.7903226 , ...,  1.1352894 ,
        -0.8323419 ,  0.19945687],
       [-1.1713653 , -0.65315473, -0.63391745, ..., -0.0796144 ,
        -0.09865176, -0.10024022]], shape=(15, 768), dtype=float32), lengths=array([1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1], dtype=int32), data_shape=(-1, 768), starts_ends=None)], last_layer_only=True)

## References

- [spaCy](https://spacy.io/usage/spacy-101)
- [Asana Inc Form 10-K](https://www.sec.gov/Archives/edgar/data/1477720/000147772025000045/asan-20250131.htm)