# Functional Company Search with SBERT and Cosine Similarity

This project implements a natural language processing pipeline for **retrieving the most functionally similar companies** from a set of business descriptions using **SBERT embeddings** and **cosine similarity analysis**.

The goal is to input a **functional prompt** (e.g., "provides healthcare consulting" or "manufactures car parts") and dynamically retrieve companies whose business descriptions best match the functional requirement. The model captures the semantic meaning of both the descriptions and the prompt to perform an accurate search.

<img src=../Images/image.png alt="" width="300">

**Objectives:**

- Extract and preprocess company business descriptions
- Generate high-quality financial text embeddings using SBERT
- Encode user functional prompts into the same embedding space
- Calculate cosine similarity scores between prompts and company descriptions
- Retrieve and rank the top-matching companies based on semantic similarity

## 1. Setup and Imports

Import libraries, configure settings, and prepare the environment for data retrieval and processing.


In [2]:
# Import required libraries
import pandas as pd
import numpy as np
import torch
import torch.nn.functional as F
import nltk
import spacy
import re
import os
import yfinance as yf
from nltk.tokenize import sent_tokenize
from transformers import AutoTokenizer, AutoModel

# Download required NLTK data
nltk.download("punkt", quiet=True)

# Set random seed
np.random.seed(42) 

#Suppress warnings
import warnings
warnings.filterwarnings('ignore')

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
  _torch_pytree._register_pytree_node(


In [20]:
# Define the base directory for data
base_dir = os.path.dirname(os.path.abspath("."))

# Define subdirectories
data_dir = os.path.join(base_dir, "Data")
embeddings_dir = os.path.join(data_dir, "Embeddings")

# Create directories if they don't exist
os.makedirs(data_dir, exist_ok=True)
os.makedirs(embeddings_dir, exist_ok=True)

# Define paths to save outputs
company_bd_data_path = os.path.join(data_dir, "company_bd_data.csv")
similarity_results_path = os.path.join(data_dir, "similarity_results.csv")
embedding_save_path = os.path.join(embeddings_dir, "bd_embeddings.pt")


## 2. Data Collection

I use two main sources for gathering data on bsuiness descriptions for public companies:

1. **Wikipedia**: To extract the current list of Russell 1000 companies, their ticker symbols, and industry classifications
2. **Yahoo Finance API (via yfinance)**: To obtain detailed business descriptions for each company

After retrieval, I perform basic data cleaning on the ticker symbols to ensure compatibility with the Yahoo Finance API format (replacing periods with dashes).

In [4]:
# Extract Russell 1000 tickers from Wikipedia
url = "https://en.wikipedia.org/wiki/Russell_1000_Index"
russell1000_table = pd.read_html(url)[3] 
tickers = russell1000_table['Symbol'].tolist()

# Replace periods with dashes (e.g., BRK.B -> BRK-B) as yfinance requires this format
tickers = [ticker.replace('.', '-') for ticker in tickers]

# Use yfinance to extract business descriptions for each ticker
bd_data = []
errors = []
for ticker in tickers:
    try:
        stock = yf.Ticker(ticker)
        info = stock.info
        description = info.get('longBusinessSummary', '') 
        bd_data.append({'Symbol': ticker, 'long_bd': description})
    except Exception as e:
        errors.append({'Symbol': ticker, 'error': str(e)})

# Add business description data to S&P 500 table
df = pd.merge(russell1000_table, pd.DataFrame(bd_data), on='Symbol', how='left')

# Print number of errors
print(f"Number of errors: {len(errors)}")

Number of errors: 578


Note: The Yahoo Finance API has rate limits that may cause errors when retrieving company data.

In [5]:
# Drop rows with missing business descriptions
df = df[df['long_bd'].notna()]

# Clean up column names for consistency
df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(" ", "_")
df.columns = df.columns.str.replace(".", "_")
df.columns = df.columns.str.replace("-", "_")

print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 424 entries, 0 to 427
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   company            424 non-null    object
 1   symbol             424 non-null    object
 2   gics_sector        424 non-null    object
 3   gics_sub_industry  342 non-null    object
 4   long_bd            424 non-null    object
dtypes: object(5)
memory usage: 19.9+ KB
None


In [6]:
# Display top rows 
df.head()

Unnamed: 0,company,symbol,gics_sector,gics_sub_industry,long_bd
0,10x Genomics,TXG,Health Care,Health Care Technology,"10x Genomics, Inc., a life science technology ..."
1,3M,MMM,Industrials,Industrial Conglomerates,3M Company provides diversified technology ser...
2,A. O. Smith,AOS,Industrials,Building Products,A. O. Smith Corporation manufactures and marke...
3,AAON,AAON,Industrials,Building Products,"AAON, Inc., together with its subsidiaries, en..."
4,Abbott Laboratories,ABT,Health Care,Health Care Equipment,"Abbott Laboratories, together with its subsidi..."


In [7]:
# Save comapny busienss description data
df.to_csv(company_bd_data_path, index=False)

## 3. Text Cleaning and Functional Filtering

I preprocess business descriptions to focus only on sentences that describe functional operations, removing generic or non-business information:

- Clean text by removing extra whitespace.
- Split descriptions into individual sentences.
- Expand the dataset to one row per sentence.
- Filter out sentences about locations, history, or organizational details.
- Keep only sentences that capture actual business activities.<br>


In [8]:
# Clean text function
def clean_text(text: str) -> str:
    """Remove extra whitespace while preserving case."""
    return re.sub(r'\s+', ' ', text).strip()

# Copy original dataframe and preprocess text
df_cleaned = df.copy()
df_cleaned["clean_long_bd"] = df_cleaned["long_bd"].apply(clean_text)
df_cleaned["sentences"] = df_cleaned["clean_long_bd"].apply(sent_tokenize)

# Expand each description into individual sentences
df_expanded = df_cleaned.explode("sentences").rename(
    columns={"sentences": "individual_sentence"}
).reset_index(drop=True)

print(f"Processed {len(df_cleaned)} companies into {len(df_expanded)} total sentences.")


Processed 424 companies into 2800 total sentences.


In [9]:
# Load Spacy model
nlp = spacy.load("en_core_web_sm")

# List of phrases indicating non-functional sentences
NON_FUNCTION_KEYWORDS = {
    "headquartered in", "located in", "offices in", "based in", "established in",
    "founded in", "incorporated in", "previously known as", "formerly known as",
    "renamed to", "was founded"
}

def describes_non_function(sentence: str) -> bool:
    """Check if a sentence contains non-functional keywords (e.g., locations, history)."""
    return any(phrase in sentence.lower() for phrase in NON_FUNCTION_KEYWORDS)

def filter_functional_sentences(df: pd.DataFrame) -> (pd.DataFrame, pd.DataFrame):
    """
    Remove non-functional sentences from the expanded dataframe.
    """
    mask = df["individual_sentence"].apply(
        lambda x: not describes_non_function(x)
    )
    df_filtered = df[mask].reset_index(drop=True)
    df_removed = df[~mask].reset_index(drop=True)
    return df_filtered, df_removed

In [10]:
# Apply the filtering
df_filtered, df_removed = filter_functional_sentences(df_expanded)

# Print stats
print(f"Number of non-functional sentences removed: {len(df_removed)}")

Number of non-functional sentences removed: 536


In [11]:
# Print the some removed sentences
print("\nExamples of removed non-functional sentences:")
print(df_removed["individual_sentence"][0:10])


Examples of removed non-functional sentences:
0    The company was formerly known as 10X Technolo...
1    10x Genomics, Inc. was incorporated in 2012 an...
2    3M Company was founded in 1902 and is headquar...
3    A. O. Smith Corporation was founded in 1874 an...
4    AAON, Inc. was incorporated in 1987 and is hea...
5    Abbott Laboratories was founded in 1888 and is...
6    The company was incorporated in 2012 and is he...
7    Acadia Healthcare Company, Inc. was founded in...
8    The company was founded in 1951 and is based i...
9    Acuity Inc. was formerly known as Acuity Brand...
Name: individual_sentence, dtype: object


## 4. Sentence Embedding Generation

In this step, I generate high-quality sentence embeddings for the cleaned business descriptions by:

- Loading the **multi-qa-mpnet-base-cos-v1** model, a state-of-the-art Sentence-BERT (SBERT) model.
- This model is specifically optimized for semantic search and question-answer retrieval tasks, making it ideal for capturing the functional meaning behind business descriptions.
- Encoding each functional sentence into a dense vector representation in a shared semantic space.
- Using batch processing for efficient handling of large datasets.
- Saving the generated embeddings for faster reuse in future steps.


In [63]:
# SBERT Model
SBERT_MODEL = "sentence-transformers/multi-qa-mpnet-base-cos-v1"
BATCH_SIZE = 64  # Batch size for encoding

# Load FinBERT model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(SBERT_MODEL)
model = AutoModel.from_pretrained(SBERT_MODEL)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

def encode_sentences(sentences: list, batch_size: int = BATCH_SIZE) -> torch.Tensor:
    """
    Generate SBERT embeddings for a list of sentences, using batching for efficiency.
    
    Args:
        sentences: List of input sentences.
        batch_size: Number of sentences to encode at once.
    
    Returns:
        Tensor of embeddings corresponding to each sentence.
    """
    embeddings = []
    
    for i in range(0, len(sentences), batch_size):
        batch = sentences[i:i + batch_size]
        inputs = tokenizer(
            batch,
            padding=True,
            truncation=True,
            max_length=512,
            return_tensors="pt"
        )
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        with torch.no_grad():
            outputs = model(**inputs)
            batch_embeddings = outputs.last_hidden_state[:, 0, :]  # CLS token
            embeddings.append(batch_embeddings.cpu())  # Move to CPU to save GPU memory
    
    return torch.cat(embeddings, dim=0)

# Prepare sentences for embedding
bd_sentences = df_filtered["individual_sentence"].tolist()

# Generate embeddings
bd_embeddings = encode_sentences(bd_sentences)

print(f"Generated embeddings for {len(bd_sentences)} sentences.")


tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Generated embeddings for 2264 sentences.


In [64]:
# Save embeddings
torch.save(bd_embeddings, embedding_save_path)
print(f"Embeddings saved.")

Embeddings saved.


## 5. Prompt-Based Similarity Search

In this step, I perform a semantic search to retrieve the companies most aligned with a functional prompt (e.g., "provides healthcare consulting") by:

- Encoding the functional prompt into a dense vector using SBERT embeddings.
- Computing cosine similarity between the prompt and all business description sentence embeddings.
- Grouping matched sentences by company and calculating the average similarity score per company.
- Sorting and returning the Top-N companies with the highest functional similarity to the prompt.


In [87]:
def find_top_n_similar_companies_grouped(
    query: str,
    bd_embeddings: torch.Tensor,
    df_filtered: pd.DataFrame,
    top_n: int = 5
) -> pd.DataFrame:
    """
    Find Top-N companies functionally similar to a given query,
    grouped by company with lists of matched sentences and similarity scores.

    Args:
        query: Functional prompt (e.g., "Manufactures automobile parts.")
        bd_embeddings: Precomputed embeddings of business description sentences.
        df_filtered: DataFrame containing company names, descriptions, and individual sentences.
        top_n: Number of top companies to return (default: 5).

    Returns:
        DataFrame where each row represents a company with:
        - Company name
        - Business description
        - List of similar sentences
        - List of similarity scores
        - Average similarity score
    """
    # Encode the query
    query_embedding = encode_sentences([query])

    # Compute cosine similarity
    cosine_scores = F.cosine_similarity(query_embedding, bd_embeddings)

    # Collect all matches without applying a threshold
    match_records = [
        {
            "company_name": df_filtered.iloc[idx]["company"],
            "business_description": df_filtered.iloc[idx]["long_bd"],
            "similar_sentence": df_filtered.iloc[idx]["individual_sentence"],
            "similarity_score": round(score.item(), 4)
        }
        for idx, score in enumerate(cosine_scores)
    ]

    # Create a temporary DataFrame
    temp_df = pd.DataFrame(match_records)

    # Group by company and aggregate sentences and scores into lists
    grouped = temp_df.groupby(["company_name", "business_description"]).agg({
        "similar_sentence": list,
        "similarity_score": list
    }).reset_index()

    # Calculate average similarity per company
    grouped["average_similarity"] = grouped["similarity_score"].apply(lambda scores: sum(scores) / len(scores))

    # Sort by average similarity and take top_n
    top_companies = grouped.sort_values(by="average_similarity", ascending=False).head(top_n).reset_index(drop=True)

    return top_companies

In [88]:
# Similarity seaarch - example useage
query = "Provides cybersecurity services."
similarity_results_df = find_top_n_similar_companies_grouped(query, bd_embeddings, df_filtered, top_n=5)

print(f"Top {len(similarity_results_df)} companies matching the query:")
similarity_results_df.head()


Top 5 companies matching the query:


Unnamed: 0,company_name,business_description,similar_sentence,similarity_score,average_similarity
0,CrowdStrike,"CrowdStrike Holdings, Inc. provides cybersecur...","[CrowdStrike Holdings, Inc. provides cybersecu...","[0.4261, 0.3993, 0.6595, 0.5276]",0.5031
1,CDW,CDW Corporation provides information technolog...,[CDW Corporation provides information technolo...,"[0.4202, 0.4142, 0.5439, 0.4883, 0.5479, 0.5966]",0.5019
2,Amdocs,"Amdocs Limited, through its subsidiaries, prov...","[Amdocs Limited, through its subsidiaries, pro...","[0.2483, 0.5449, 0.4429, 0.638, 0.5773]",0.4903
3,Cognizant,"Cognizant Technology Solutions Corporation, a ...","[Cognizant Technology Solutions Corporation, a...","[0.3989, 0.4208, 0.6492, 0.4784, 0.4188, 0.5332]",0.4832
4,Fidelity National Information Services,"Fidelity National Information Services, Inc. e...","[Fidelity National Information Services, Inc. ...","[0.2981, 0.4926, 0.5754, 0.5283]",0.4736


In [89]:
# Save results
similarity_results_df.to_csv(similarity_results_path, index=False)

## 6. Conclusion

This project successfully implemented a prompt-based semantic search pipeline to retrieve functionally similar companies based on business descriptions.  
By leveraging SBERT embeddings and cosine similarity analysis, the project is able to:

- Capture the functional meaning behind both company descriptions and user prompts.
- Group and rank companies based on the strength of their semantic similarity.

To further enhance the system, pre-filtering the business description dataset based on sector classifications, and financial metrics could:

- Reduce computational overhead by reducing data.
- Improve retrieval quality by focusing the search on more contextually relevant companies.
- Enable more targeted functional comparisons in specialized domains.
