# *Semantic Search Query Model*
## Cailean Bushnell 

### **Overview**

The objective of this project is to develop a semantic search system that enhances the relevance of database query results, enabling ASPIRE customers to more effectively locate desired content.

After considerable research and experimentation, we determined that integrating a pre-trained semantic model into our user interface would be the most feasible approach. Existing off-the-shelf search indexes were either incompatible with our database structure or exceeded our computational constraints.

## *Retrieve & Re-Rank Model*

### **Model Selection**

We selected the `multi-qa-MiniLM-L6-cos-v1` model, which is optimized for semantic search and available via HuggingFace. This model is capable of identifying relevant documents, passages, or articles based on a user-provided query. It was trained on 215 million (question, answer) pairs sourced from platforms such as Bing, Google Search, and Quora. The model supports a maximum sequence length of 512 tokens and utilizes mean pooling for embedding generation.

The retrieve-and-re-rank architecture substantially improves semantic search performance by enabling a two-stage filtering process: initial retrieval of potentially relevant articles, followed by refined ranking. This framework is particularly effective for addressing complex or ambiguous queries, allowing the system to surface the most contextually appropriate ASPIRE articles.

### **Model Pipeline**

The semantic search process proceeds through the following pipeline:

1. **Query Input:** The user enters a search query.
2. **Bi-Encoder Retrieval:** The query is encoded and passed through the bi-encoder, which searches the document corpus and retrieves the top 10 semantically relevant articles.
3. **Cross-Encoder Re-Ranking:** The retrieved articles are then passed through the cross-encoder, which re-ranks them based on a deeper semantic comparison with the original query.
4. **Output:** A ranked list of results is returned and presented to the user.

### **Technical Architecture**

Our retrieval system combines both lexical and vector-based search techniques. Lexical search captures exact textual matches but fails to account for synonyms, acronyms, or spelling variations. To overcome these limitations, we use a bi-encoder to transform both queries and articles into dense vector representations, enabling more flexible and context-aware retrieval via similarity measures such as dot product, cosine similarity, or Euclidean distance.

Although tools like ElasticSearch and Approximate Nearest Neighbor (ANN) algorithms can also be used for dense vector retrieval, we opted for a bi-encoder due to its simplicity and effectiveness within our project timeline.

While bi-encoders are efficient and scalable, they may occasionally return irrelevant results. To address this, we incorporated a cross-encoder for fine-grained semantic evaluation. Specifically, we utilized the `cross-encoder/ms-marco-MiniLM-L-6-v2` model, which jointly processes the query and each candidate article through a transformer network and outputs a relevance score between 0 and 1. This mechanism significantly improves performance, especially when dealing with domains not explicitly represented during the bi-encoder’s training phase.

Importantly, our implementation does not involve fine-tuning. We concluded that the model's pre-training on a broad range of data was sufficient given the nature of our corpus. Additionally, the absence of clearly defined "correct" answers for user queries made supervised fine-tuning infeasible within our timeframe.

### *Part 1: Installing Dependencies and Preprocessing Data*

To evaluate our search model, we compiled a CSV file containing scraped articles from investment-related websites. These articles serve as our test corpus for determining the relevance of returned results.

The first step involves installing the required dependencies and converting the scraped content into a structured DataFrame.


In [1]:
#installing the first transformer for the bi-encoder
!pip install -U sentence-transformers rank_bm25

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


In [2]:
#importing pandas for dataframe usage
import pandas as pd

In [3]:
df_all = pd.read_csv("aspire_articles.csv")

In [4]:
df_all.head()

Unnamed: 0,article_url,company_name,article_title,article_description,article_body,article_date
0,https://global.abb/news/detail/102388/abb-plan...,ABB Inc.,ABB plans to delist ADRs from NYSE,,ABB is planning to delist its American Deposit...,2025-04-23 00:00:00
1,https://global.abb/news/detail/102394/q1-2023-...,ABB Inc.,Q1 2023 results,,Ad hoc Announcement pursuant to Art. 53 Listin...,2025-04-23 00:00:00
2,https://global.abb/news/detail/102184/abb-form...,ABB Inc.,ABB Formula E to showcase e-mobility excellenc...,,ABB FIA Formula E World Championship returns t...,2020-04-23 00:00:00
3,https://global.abb/news/detail/101645/abb-inve...,ABB Inc.,ABB invests $170 million in the US,,Investment reflects increased customer demand ...,2004-04-23 00:00:00
4,https://global.abb/news/detail/101540/abb-laun...,ABB Inc.,ABB launches new share buyback program of up t...,,"On April 3, 2023, ABB will launch its previous...",2031-03-23 00:00:00


In [26]:
#creating a dataframe with our own ASPIRE articles to test our model on
articles = pd.read_csv("aspire_articles.csv")
articles = articles.drop_duplicates(subset='article_url')
articles.head()

Unnamed: 0,article_url,company_name,article_title,article_description,article_body,article_date
0,https://global.abb/news/detail/102388/abb-plan...,ABB Inc.,ABB plans to delist ADRs from NYSE,,ABB is planning to delist its American Deposit...,2025-04-23 00:00:00
1,https://global.abb/news/detail/102394/q1-2023-...,ABB Inc.,Q1 2023 results,,Ad hoc Announcement pursuant to Art. 53 Listin...,2025-04-23 00:00:00
2,https://global.abb/news/detail/102184/abb-form...,ABB Inc.,ABB Formula E to showcase e-mobility excellenc...,,ABB FIA Formula E World Championship returns t...,2020-04-23 00:00:00
3,https://global.abb/news/detail/101645/abb-inve...,ABB Inc.,ABB invests $170 million in the US,,Investment reflects increased customer demand ...,2004-04-23 00:00:00
4,https://global.abb/news/detail/101540/abb-laun...,ABB Inc.,ABB launches new share buyback program of up t...,,"On April 3, 2023, ABB will launch its previous...",2031-03-23 00:00:00


In [7]:
#cutting out the columns in our dataframe that are less important
article_df = articles[['article_body', 'article_description', 'article_title']]
article_df.head()

Unnamed: 0,article_body,article_description,article_title
0,ABB is planning to delist its American Deposit...,,ABB plans to delist ADRs from NYSE
1,Ad hoc Announcement pursuant to Art. 53 Listin...,,Q1 2023 results
2,ABB FIA Formula E World Championship returns t...,,ABB Formula E to showcase e-mobility excellenc...
3,Investment reflects increased customer demand ...,,ABB invests $170 million in the US
4,"On April 3, 2023, ABB will launch its previous...",,ABB launches new share buyback program of up t...


In [8]:
#importing dependencies
import numpy as np

In [9]:
#Creating a row in the dataframe that combines the title, description, and body of the article
#this helps us to have all the important information when looking for the right article.
article_df['combined_articles'] = [' '.join([row[col] for col in ['article_title', 'article_description','article_body'] if isinstance(row[col], str)]) for i, row in article_df.iterrows()]
article_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  article_df['combined_articles'] = [' '.join([row[col] for col in ['article_title', 'article_description','article_body'] if isinstance(row[col], str)]) for i, row in article_df.iterrows()]


Unnamed: 0,article_body,article_description,article_title,combined_articles
0,ABB is planning to delist its American Deposit...,,ABB plans to delist ADRs from NYSE,ABB plans to delist ADRs from NYSE ABB is plan...
1,Ad hoc Announcement pursuant to Art. 53 Listin...,,Q1 2023 results,Q1 2023 results Ad hoc Announcement pursuant t...
2,ABB FIA Formula E World Championship returns t...,,ABB Formula E to showcase e-mobility excellenc...,ABB Formula E to showcase e-mobility excellenc...
3,Investment reflects increased customer demand ...,,ABB invests $170 million in the US,ABB invests $170 million in the US Investment ...
4,"On April 3, 2023, ABB will launch its previous...",,ABB launches new share buyback program of up t...,ABB launches new share buyback program of up t...


In [10]:
#creating a dataframe with just the one important column
combined_articles = article_df[['combined_articles']]
combined_articles.head()

Unnamed: 0,combined_articles
0,ABB plans to delist ADRs from NYSE ABB is plan...
1,Q1 2023 results Ad hoc Announcement pursuant t...
2,ABB Formula E to showcase e-mobility excellenc...
3,ABB invests $170 million in the US Investment ...
4,ABB launches new share buyback program of up t...


In [11]:
#dropping duplicates if there are any
deduped = combined_articles.drop_duplicates()

In [12]:
combined_articles.shape

(1775, 1)

### Part 2: Creating the BiEncoder, CrossEncoder and preprocessing the data

Now that we have our data all prepared, we need to preprocess and vectorize the data, as well as set up the models. As these models are pretrained, we tested them on our concatenated dataset of combined articles.

In [14]:
pip install cuda-python

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Note: you may need to restart the kernel to use updated packages.


In [26]:
!pip install nvidia-pyindex
!pip install nvidia-cudnn

Collecting nvidia-pyindex
  Downloading nvidia-pyindex-1.0.9.tar.gz (10 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: nvidia-pyindex
  Building wheel for nvidia-pyindex (setup.py): started
  Building wheel for nvidia-pyindex (setup.py): finished with status 'done'
  Created wheel for nvidia-pyindex: filename=nvidia_pyindex-1.0.9-py3-none-any.whl size=8419 sha256=1577e17cf9e33cebf2fd446ac7d4fa55b8c42abb18f7999829e85f486c22d785
  Stored in directory: c:\users\cobus\appdata\local\pip\cache\wheels\39\63\71\c50214b560fa8c319598c2de3c1616f6d68e1d2c7f17a5e82d
Successfully built nvidia-pyindex
Installing collected packages: nvidia-pyindex
Successfully installed nvidia-pyindex-1.0.9
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com

  error: subprocess-exited-with-error
  
  python setup.py egg_info did not run successfully.
  exit code: 1
  
  [17 lines of output]
  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "C:\Users\cobus\AppData\Local\Temp\pip-install-ydf6gi8p\nvidia-cudnn_f795c8e7ed2441879e0954011a2bc3f4\setup.py", line 150, in <module>
      raise RuntimeError(open("ERROR.txt", "r").read())
  RuntimeError:
  ###########################################################################################
  The package you are trying to install is only a placeholder project on PyPI.org repository.
  This package is hosted on NVIDIA Python Package Index.
  
  This package can be installed as:
  ```
  $ pip install nvidia-pyindex
  $ pip install nvidia-cudnn
  ```
  ###########################################################################################
  
  [end of output]
  
  note: This error originates from a sub


Collecting nvidia-cudnn
  Downloading nvidia-cudnn-0.0.1.dev5.tar.gz (7.9 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'error'


In [22]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import gzip
import os
import torch

# bi_encoder = torch.load("C:\\Users\\cobus\\OneDrive\\Documents\\GitHub\\hackusu\\bi_encoder", map_location= torch.device('cpu'))
# cross_encoder = torch.load("C:\\Users\\cobus\\OneDrive\\Documents\\GitHub\\hackusu\\cross_encoder", map_location= torch.device('cpu'))


if not torch.cuda.is_available():
    print("Warning: No GPU found. Please add GPU to your notebook")


# #Use the Bi-Encoder to encode all passages, so that we can use it with sematic search
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
#Truncate long passages to 256 tokens
bi_encoder.max_seq_length = 256
#Number of passages we want to retrieve with the bi-encoder
top_k = 10                         

#The bi-encoder will retrieve 100 documents. We use a cross-encoder, to re-rank the results list to improve the quality
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

passages = combined_articles['combined_articles']


print("Passages:", len(passages))

# We encode all passages into our vector space. This takes about 5 minutes (depends on your GPU speed)
corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)


Passages: 1775


Batches:   0%|          | 0/56 [00:00<?, ?it/s]

In [1]:
# print(corpus_embeddings)

In [None]:
#zip bi-encoder model to be used in the future
# !zip -r bi_encoder.zip bi_encoder

In [None]:
#save cross-encoder model
# cross_encoder.save('cross_encoder')

In [None]:
#zip the cross-encoder model
# !zip -r cross_encoder.zip cross_encoder

  adding: cross_encoder/ (stored 0%)
  adding: cross_encoder/config.json (deflated 47%)
  adding: cross_encoder/pytorch_model.bin (deflated 9%)
  adding: cross_encoder/special_tokens_map.json (deflated 42%)
  adding: cross_encoder/tokenizer_config.json (deflated 45%)
  adding: cross_encoder/tokenizer.json (deflated 71%)
  adding: cross_encoder/vocab.txt (deflated 53%)


### **Part 3: Testing the Model on a Custom Dataset**

To evaluate the model's effectiveness in our target use case, we compiled a dataset of approximately 2,500 to 3,000 articles scraped from ASPIRE investor websites. These articles were consolidated into a CSV file and preprocessed into a DataFrame.

Because both the bi-encoder and cross-encoder were pre-trained on external datasets, we incorporated a lexical (keyword-based) search component to complement the semantic retrieval process. The lexical search serves as an initial filter to ensure that results contain explicit keyword matches relevant to the user's query.

This hybrid approach—combining keyword filtering with semantic re-ranking—improves both the accuracy and efficiency of the retrieval pipeline. It allows us to discard obviously irrelevant articles early in the process, reducing computational overhead and enhancing the responsiveness of the user interface.

By narrowing the search space before semantic re-ranking, we ensure that the final output not only captures contextual meaning but also preserves literal relevance, thereby increasing user satisfaction and improving the overall usability of the ASPIRE query system.

In [23]:
# We also compare the results to lexical search (keyword search). Here, we use 
# the BM25 algorithm which is implemented in the rank_bm25 package.

from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string
from tqdm.autonotebook import tqdm
import numpy as np


# We lower case our text and remove stop-words from indexing
def bm25_tokenizer(text):
    tokenized_doc = []
    for token in text.lower().split():
        token = token.strip(string.punctuation)

        if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    return tokenized_doc


tokenized_corpus = []
for passage in tqdm(passages):
    tokenized_corpus.append(bm25_tokenizer(passage))

bm25 = BM25Okapi(tokenized_corpus)


  0%|          | 0/1775 [00:00<?, ?it/s]

In [24]:
# Function searches all articles for passages that answer the query
def search(query):
    print("Input question:", query)

    ##### BM25 search (lexical search) #####
    bm25_scores = bm25.get_scores(bm25_tokenizer(query))
    top_n = np.argpartition(bm25_scores, -5)[-5:]
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)

    #### Semantic Search #####
    # Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
#     question_embedding = question_embedding.cuda()
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
    hits = hits[0]  # Get the hits for the first query

    ##### Re-Ranking #####
    # Now, score all retrieved passages with the cross_encoder
    cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
    cross_scores = cross_encoder.predict(cross_inp)

    # Re-score happens here
    for idx in range(len(cross_scores)):
        hits[idx]['cross-score'] = cross_scores[idx]

    # Sort and output
    hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
    return articles.loc[[hit['corpus_id'] for hit in hits], :]
#     return article_df.loc[from_deduped.index, :]


In [25]:
search(query = "electric vehicles")

Input question: electric vehicles


Unnamed: 0,article_url,company_name,article_title,article_description,article_body,article_date
715,https://newsroom.kiewit.com/news/how-electric-...,Kiewit Infrastructure Co,How Electric Vehicles are Disrupting the Grid ...,,Whether it be cruising down the highway or par...,2018-07-23 00:00:00
136,https://calstart.org/electric-cars-arent-just-...,calstart,Fresno Bee | Electric cars aren’t just a big-c...,,,
140,https://calstart.org/tesla-releases-its-electr...,calstart,Scientific American | Tesla Releases Its Elect...,,,
307,https://www.stellantis.com/en/news/press-relea...,Fiat Chrysler Automobiles (Stellantis),Jeep® Brand Reveals Plan to Become the Leading...,,Jeep® brand will introduce four all-electric S...,2022-09-08 00:00:00
892,https://nacfe.org/news/electric-trucks-reach-a...,NACFE (North American Council for Freight Effi...,Electric Trucks Reach a Tipping Point,New analysis from RMI shows that 65% of medium...,New analysis from RMI shows that 65% of medium...,2023-02-26 00:00:00
707,https://newsroom.kiewit.com/news/an-uncertain-...,Kiewit Infrastructure Co,An Uncertain Transition: Merging Transportatio...,,Cars that transmit data to other cars and surr...,2018-11-13 00:00:00
967,https://nacfe.org/news/second-guidance-report-...,NACFE (North American Council for Freight Effi...,Second Guidance Report: Medium-Duty Electric T...,,\n\nThe North American Council for Freight Ef...,
895,https://nacfe.org/news/more-than-half-of-ameri...,NACFE (North American Council for Freight Effi...,More than Half of American Commercial Vehicles...,More than Half of American Commercial Vehicles...,More than Half of American Commercial Vehicles...,
1757,https://pressroom.toyota.com/five-highlights-f...,Toyota Motor Engineering & Manufacturing North...,,,ENVIRONMENT\nFive Highlights From Toyota’s Ann...,2023-04-19 00:00:00
196,https://epsenergy.com/ampaire-selects-electric...,Electric Power Systems,Ampaire Selects Electric Power Systems to Supp...,,,2022-09-19 00:00:00


In [35]:
search(query = "hydraulics")

Input question: hydraulics

-------------------------

Top-3 Cross-Encoder Re-ranker hits
	0.475	\n	\n  NCDOT Hydraulics Unit Receives Pelican Award\n\n "​​​From left to right, Andy McDaniel, Stephen Morgan, Ryan Mullins and Matthew Lauffer of the Hydraulics Unit pictured with Lauren Kolodij.\n\n\n​RALEIGH – The N.C. Department of Transportation Hydraulics Unit garnered special recognition over the weekend for its efforts to protect and improve water quality.\n\nThe Unit was named one of three coastwide winners of a 2022 Pelican Award during the N.C. Coastal Federation's annual event held Saturday in Morehead City.\n\nThe Pelican Award honors volunteers, businesses, agencies and organizations that go above and beyond to ensure a healthy North Carolina coast for future generations.\n\nThe Federation commended the group for its dedicated advancement of nature-based resilience initiatives, like the Unit's work on the living shoreline project along N.C. 24. That project is part of NCDOT's 

In [None]:
search(query = "electric roadways")

Input question: electric roadways
Top-3 lexical search (BM25) hits
	7.419	Article: Why Integrated Roadways makes Fast Company’s ‘Next Big Things in Tech’ ", a digital infrastructure technology provider, today announced that it has been named to Fast Company's second annual "
	7.419	Interview: Kansas City's Most Inspiring Stories: Meet Tim Sylvester of Integrated Roadways ", a digital infrastructure technology provider, today announced that it has been named to Fast Company's second annual "
	5.964	\n	\n  Road Improvements Coming to Anson County\n\n "​WADESBORO – Drivers in Anson County will have a smoother ride on more than six miles of roadways, thanks to a $2.4 million contract awarded by the Department of Transportation.\n\nThe contract calls for milling, resurfacing, shoulder reconstruction, and installing pavement markings and snow plowable markers to several roads, including 0.64 miles of U.S. 74 in the vicinity of Morven Freight Line Road and the end of the highway's five-lane s

### **Part 4: Discussion and Results**

The results of our semantic search system have been promising and align closely with our project goals. However, evaluating "relevance" remains inherently subjective and context-dependent. Without direct user interaction or labeled feedback data, the model’s performance cannot be significantly improved through traditional supervised learning. To address this, we are currently exploring the integration of reinforcement learning to iteratively enhance the quality of search results based on user behavior.

We are also in the process of integrating the model into our user interface. Our focus has been on ensuring that only meaningful and contextually appropriate results are displayed—improving both accuracy and presentation to deliver a polished user experience.

Another key feature under development is the ability to apply filters based on relevance score, company name, and publication date. These filters will allow users to refine their search results and more easily locate specific content relevant to their interests.

A notable challenge we encountered was related to database integration. Initially, we used DynamoDB for data storage but later transitioned to Firebase due to compatibility concerns. We are currently experimenting with storing article content in CSV format for deployment, while using Firebase as a backup storage solution. This approach helps mitigate the high costs associated with continuous cloud storage access.

The web application itself, though still in an early visual stage, was built using Flask and Django—both Python-based frameworks. This choice allows for seamless interaction between the model and the front end without relying on JavaScript. Once backend integration is finalized, we plan to deploy the web interface and complete the end-to-end user experience.


## **Next Steps**

Looking ahead, our primary goal is to implement reinforcement learning mechanisms that adapt based on user feedback and interaction. This will allow us to improve the model’s relevance scoring and continue refining the search experience for ASPIRE customers.
