# Implementing Semantic Search

## **1. LangChain Setup**

#### **Loading data from text files and urls**

In [6]:
import os
#Text file loader
from langchain.document_loaders import TextLoader

In [7]:
print(os.getcwd())

/Users/Aditya/Desktop/Equity_Research


In [22]:
#Test TextLoader on .txt file
loader = TextLoader("nvda_news_1.txt")
data = loader.load()
print(data[0].metadata, '\n')
print(data[0].page_content)

{'source': 'nvda_news_1.txt'} 

The stock of NVIDIA Corp (NASDAQ:NVDA) experienced a daily loss of -3.56% and a 3-month gain of 32.35%. With an Earnings Per Share (EPS) (EPS) of $1.92, the question arises: is the stock significantly overvalued? This article aims to provide a detailed valuation analysis of NVIDIA, offering insights into its financial strength, profitability, growth, and more. We invite you to delve into this comprehensive analysis.

Company Overview

NVDA 30-Year Financial Data

The intrinsic value of NVDA


NVIDIA Corp (NASDAQ:NVDA) is a leading designer of discrete graphics processing units that enhance the experience on computing platforms. The firm's chips are widely used in various end markets, including PC gaming and data centers. In recent years, NVIDIA has broadened its focus from traditional PC graphics applications such as gaming to more complex and favorable opportunities, including artificial intelligence and autonomous driving, leveraging the high-performan

In [24]:
#URL loader
!pip3 install unstructured libmagic python-magic python-magic-bin

Collecting unstructured
  Downloading unstructured-0.14.3-py3-none-any.whl.metadata (31 kB)
Collecting libmagic
  Downloading libmagic-1.0.tar.gz (3.7 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting python-magic
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
[31mERROR: Ignored the following versions that require a different python version: 0.12.0 Requires-Python >=3.9.0,<3.12; 0.12.2 Requires-Python >=3.9.0,<3.12; 0.12.3 Requires-Python >=3.9.0,<3.12; 0.12.4 Requires-Python >=3.9.0,<3.12; 0.12.5 Requires-Python >=3.9.0,<3.12; 0.12.6 Requires-Python >=3.9.0,<3.12; 0.13.0 Requires-Python <3.12,>=3.9.0; 0.13.1 Requires-Python <3.12,>=3.9.0; 0.13.2 Requires-Python <3.12,>=3.9.0; 0.13.3 Requires-Python <3.12,>=3.9.0; 0.13.4 Requires-Python <3.12,>=3.9.0; 0.13.5 Requires-Pyth

In [26]:
from langchain.document_loaders import UnstructuredURLLoader

In [38]:
loader = UnstructuredURLLoader(urls=[
    "https://www.bloomberg.com/news/articles/2024-05-31/nordstrom-jwn-loyalty-memberships-drive-deeper-quarterly-loss?srnd=industries-v2",
    "https://www.bloomberg.com/news/articles/2024-05-30/foot-locker-turnaround-shows-signs-of-life-as-sales-stabilize?srnd=industries-v2"
])

In [42]:
data = loader.load()
len(data)
data[0]

Document(page_content="Bloomberg\n\nNeed help? Contact us\n\nWe've detected unusual activity from your computer network\n\nTo continue, please click the box below to let us know you're not a robot.\n\nWhy did this happen?\n\nPlease make sure your browser supports JavaScript and cookies and that you are not\n            blocking them from loading.\n            For more information you can review our Terms of\n                Service and Cookie Policy.\n\nNeed Help?\n\nFor inquiries related to this message please contact\n            our support team and provide the reference ID below.\n\nBlock reference ID:", metadata={'source': 'https://www.bloomberg.com/news/articles/2024-05-31/nordstrom-jwn-loyalty-memberships-drive-deeper-quarterly-loss?srnd=industries-v2'})

## **2. Text Splitting**

#### **Reducing articles into relevant chunks to prevent token overuse**

In [50]:
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=200,
    chunk_overlap=0
)

#Test splitter
text = """Nvidia blasted higher after smashing targets on May 22. Shares shot past the 1,000 mark as the stock flew right out of buy range. At the same time, Google stock continues its return to glory, clearing is most recent entry, a 153.78 buy point. Like Nvidia, Alphabet trades right around a record high.Like Nvidia and Alphabet, fellow Magnificent Seven stock Microsoft remains within striking distance of its own record high.

Meanwhile, Tesla stock gapped up after reporting earnings on April 23. But the EV maker has given back much of those gains, although it remains above its 50-day line.In addition to monitoring IBD's recommended market exposure level, regularly checking names on lists like the IBD 50, IBD Long-Term Leaders and IBD Sector Leaders provides a good starting point for finding and evaluating potential stock picks. Each day, you can also see which stocks just came on or off these lists. """

chunks = splitter.split_text(text)
len(chunks)
chunks

Created a chunk of size 422, which is longer than the specified 200


['Nvidia blasted higher after smashing targets on May 22. Shares shot past the 1,000 mark as the stock flew right out of buy range. At the same time, Google stock continues its return to glory, clearing is most recent entry, a 153.78 buy point. Like Nvidia, Alphabet trades right around a record high.Like Nvidia and Alphabet, fellow Magnificent Seven stock Microsoft remains within striking distance of its own record high.',
 "Meanwhile, Tesla stock gapped up after reporting earnings on April 23. But the EV maker has given back much of those gains, although it remains above its 50-day line.In addition to monitoring IBD's recommended market exposure level, regularly checking names on lists like the IBD 50, IBD Long-Term Leaders and IBD Sector Leaders provides a good starting point for finding and evaluating potential stock picks. Each day, you can also see which stocks just came on or off these lists."]

In [57]:
#Use Recursive Splitter to allow multiple separators
from langchain.text_splitter import RecursiveCharacterTextSplitter
rsplitter = RecursiveCharacterTextSplitter(
    separators = ["\n\n", "\n", " "],
    chunk_size=200,
    chunk_overlap=0
)

chunks = rsplitter.split_text(text)
chunks

['Nvidia blasted higher after smashing targets on May 22. Shares shot past the 1,000 mark as the stock flew right out of buy range. At the same time, Google stock continues its return to glory, clearing',
 'is most recent entry, a 153.78 buy point. Like Nvidia, Alphabet trades right around a record high.Like Nvidia and Alphabet, fellow Magnificent Seven stock Microsoft remains within striking distance',
 'of its own record high.',
 "Meanwhile, Tesla stock gapped up after reporting earnings on April 23. But the EV maker has given back much of those gains, although it remains above its 50-day line.In addition to monitoring IBD's",
 'recommended market exposure level, regularly checking names on lists like the IBD 50, IBD Long-Term Leaders and IBD Sector Leaders provides a good starting point for finding and evaluating potential',
 'stock picks. Each day, you can also see which stocks just came on or off these lists.']

## **3. Vector Database**

#### **Convert chunks of text into embeddings**

In [62]:
!pip3 install faiss-cpu
!pip3 install sentence-transformers

Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (3.6 kB)
Downloading faiss_cpu-1.8.0-cp312-cp312-macosx_11_0_arm64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.8.0
Collecting sentence-transformers
  Downloading sentence_transformers-3.0.0-py3-none-any.whl.metadata (10 kB)
Collecting transformers<5.0.0,>=4.34.0 (from sentence-transformers)
  Downloading transformers-4.41.2-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.8/43.8 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting torch>=1.11.0 (from sentence-transformers)
  Downloading torch-2.3.0-cp312-none-macosx_11_0_arm64.whl.metadata (26 kB)
Collecting scikit-learn (from sentence-transformers)
  Downloading scikit_learn-1.5.0-cp312-cp312-macosx_12_0_

In [64]:
import pandas as pd

pd.set_option('display.max_colwidth', 100)
df = pd.read_csv("sample_text.csv")
df

Unnamed: 0,text,category
0,Meditation and yoga can improve mental health,Health
1,"Fruits, whole grains and vegetables helps control blood pressure",Health
2,These are the latest fashion trends for this week,Fashion
3,Vibrant color jeans for male are becoming a trend,Fashion
4,The concert starts at 7 PM tonight,Event
5,Navaratri dandiya program at Expo center in Mumbai this october,Event
6,Exciting vacation destinations for your next trip,Travel
7,Maldives and Srilanka are gaining popularity in terms of low budget vacation places,Travel


In [67]:
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("all-mpnet-base-v2")
vectors = encoder.encode(df.text)
vectors.shape
dim = vectors.shape[1]
dim

768

#### **Store embeddings in FAISS (Facebook AI Similarity Search) Index**

In [69]:
import faiss
index = faiss.IndexFlatL2(dim)
index
index.add(vectors)

In [71]:
search_query = "I want to buy a polo t-shirt"
vec = encoder.encode(search_query)
vec.shape # (768,)
import numpy as np
search_vector = np.array(vec).reshape(1,-1) #Turn vector into 2d array
search_vector.shape

(1, 768)

In [73]:
distances, I = index.search(search_vector, k = 2)
I

array([[3, 2]])

In [76]:
df.loc[I[0]]

Unnamed: 0,text,category
3,Vibrant color jeans for male are becoming a trend,Fashion
2,These are the latest fashion trends for this week,Fashion


# OPENAI Integration

#### **Initialize LLM**

In [82]:
import os
import streamlit as st
import pickle
import time
import langchain
from langchain import OpenAI
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chains.qa_with_sources.loading import load_qa_with_sources_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredURLLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS

In [83]:
os.environ['OPENAI_API_KEY'] = 'your_openai_api_key_here'

In [96]:
llm = OpenAI(temperature=0.9, max_tokens=500)
loader = UnstructuredURLLoader(urls=[
    "https://en.wikipedia.org/wiki/List_of_mergers_and_acquisitions_by_Apple",
    "https://en.wikipedia.org/wiki/Tesla,_Inc."
])
data = loader.load()
len(data)

2

In [107]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

docs = splitter.split_documents(data)
len(docs)
print(docs[38].page_content)

^ "Apple acquires digital magazine startup Prss". The Next Web. September 23, 2014. Retrieved September 24, 2014.

^ "Apple Acquired Prss in 2014 and today their Invention that became 'Apple News' was Published by USPTO". Patently Apple. Retrieved November 23, 2017.

^ "Apple Quietly Bought Dryft, A Keyboard App". TechCrunch. April 8, 2015.

^ "Apple has acquired London music production software company Camel Audio". Business Insider. February 24, 2015.

^ "Apple buys the British startup behind music analytics service Musicmetric". The Wall Street Journal. January 21, 2015.

^ "Apple buys UK music start-up Semetric". The Telegraph.

^ "Apple Acquires Durable Database Company FoundationDB". TechCrunch. March 24, 2015.

^ "Apple Buys Israeli Camera-Technology Company LinX". The Wall Street Journal. April 14, 2015.

^ "Apple bought a company focused on super-accurate GPS". Engadget. May 17, 2015.

^ "Apple Acquires Augmented Reality Company Metaio". TechCrunch. May 28, 2015.


In [120]:
embeddings = OpenAIEmbeddings()
vectorindex_openai = FAISS.from_documents(docs, embeddings)

In [121]:
#Store vector index in local file
file_path="vector_index.faiss"
#with open(file_path, "wb") as f:
#    pickle.dump(vectorindex_openai, f)
faiss.write_index(vectorindex_openai.index, file_path)

In [124]:
if os.path.exists(file_path):
    #with open(file_path, "rb") as f:
        #vectorIndex = pickle.load(f)
    vectorindex_openai = FAISS.from_documents(docs, embeddings)
    vectorindex_openai.index = faiss.read_index(file_path)
else:
    print(f"Index file not found: {file_path}")

In [129]:
retriever = vectorindex_openai.as_retriever()
chain = RetrievalQAWithSourcesChain.from_llm(llm=llm, retriever=retriever)
chain



In [130]:
query = "What is Apple's business philosophy?"
langchain.debug=True
chain({"question": query}, return_only_outputs=True)

  warn_deprecated(


[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQAWithSourcesChain] Entering Chain run with input:
[0m{
  "question": "What is Apple's business philosophy?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQAWithSourcesChain > chain:MapReduceDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQAWithSourcesChain > chain:MapReduceDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "input_list": [
    {
      "context": "equity stakes in three preexisting companies, and has made three\n\ndivestments. Apple has not released the financial details for the majority of its mergers and acquisitions.\n\nApple's business philosophy is to acquire small companies that can be easily integrated into existing company projects.[6] For instance, Apple acquired Emagic and its professional music software, Logic Pro, in 2002. The acquisition was incorporated in the creation of the digital audio workstation software Garag

{'answer': " Apple's business philosophy is to consistently acquire small companies to integrate into existing projects.\n",
 'sources': 'https://en.wikipedia.org/wiki/List_of_mergers_and_acquisitions_by_Apple'}