<a href="https://colab.research.google.com/github/athapa785/LLM_4_Biz_Stanford/blob/main/aditya_thapa_llm4biz_homework_3_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Aditya Thapa

SLAC National Accelerator Laboratory | Stanford University

# Homework 3: Retrieval Augumented Generation (RAG) with Llama-index
## LLM for Biz with Python

This project leverages the Llama Index library to build a Retrieval Augmented Generation (RAG) system for querying information from various sources. It showcases how to load data from different document types, create a chatbot interface, and enhance responses using vector databases and external knowledge sources like Arxiv and YouTube transcripts.

**Key functionalities implemented**:

**Chatbot Interaction:**
Creates a chatbot using as_chat_engine for interactive querying.
Allows users to ask questions and receive relevant responses.
Maintains chat history for context.

**Vector Databases:**
Integrates with ChromaDB to store and retrieve vector embeddings of the documents, improving search performance.
Uses ChromaVectorStore for interacting with the vector database.

**Knowledge Augmentation:**
Specifically leverages Arxiv for retrieving information on high-energy astrophysics.
Uses YouTube transcripts to answer questions about James Webb Space Telescope (JWST) discoveries.

In [1]:
!wget https://www.commerce.gov/sites/default/files/2025-01/2021-2024-Space-Accomplishments.pdf

--2025-02-18 19:57:30--  https://www.commerce.gov/sites/default/files/2025-01/2021-2024-Space-Accomplishments.pdf
Resolving www.commerce.gov (www.commerce.gov)... 172.65.90.24, 172.65.90.27, 172.65.90.26, ...
Connecting to www.commerce.gov (www.commerce.gov)|172.65.90.24|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2184033 (2.1M) [application/pdf]
Saving to: ‘2021-2024-Space-Accomplishments.pdf’


2025-02-18 19:57:53 (119 KB/s) - ‘2021-2024-Space-Accomplishments.pdf’ saved [2184033/2184033]



In [2]:
%%capture
!pip install llama-index --upgrade

In [3]:
%%capture
!pip install pypdf

In [4]:
# Initialize key and client

from openai import OpenAI
from google.colab import userdata

open_ai_key = userdata.get('open_ai_key')

client = OpenAI(api_key=open_ai_key)

In [5]:
import os
os.environ["OPENAI_API_KEY"] = open_ai_key

In [6]:
from IPython.display import Markdown

In [7]:
# Import necessary classes from the llama_index package
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# Read documents from the specified directory and load a specific document
documents = SimpleDirectoryReader("./").load_data("2021-2024-Space-Accomplishments.pdf")

# Create a VectorStoreIndex object from the documents. This will involve processing the documents
# and creating a vector representation for each of them, suitable for semantic searching.
index = VectorStoreIndex.from_documents(documents)

Loading files: 100%|██████████| 1/1 [00:02<00:00,  2.06s/file]


# Make a chatbot

In [8]:
chat_engine = index.as_chat_engine()
response = chat_engine.chat("What are some of the recent space accomplishments?")
Markdown(response.response)

Some recent space accomplishments include leveraging commercial space capabilities for weather observation, fostering diversity and opportunity in the space industry, measuring the U.S. space economy, and supporting space-related intellectual property. These achievements involve initiatives such as improving weather forecasts through commercial satellite data buys, promoting diversity and inclusion in the space industry, quantifying the U.S. space economy's contributions, and supporting commercial space innovation through intellectual property initiatives.

In [9]:
response = chat_engine.chat("Are there any updates on weather observations?")
Markdown(response.response)

NOAA has made various commercial satellite data buys to enhance its weather forecasts, including placing seven data orders for radio occultation satellite data. Additionally, NOAA bought commercial satellite data for space weather, ocean surface winds, and microwave sounding evaluations.

In [10]:
for chat in chat_engine.chat_history:
  print(chat)

user: What are some of the recent space accomplishments?
assistant: None
tool: Recent space accomplishments include leveraging commercial space capabilities for weather observation, fostering diversity and opportunity in the space industry, measuring the U.S. space economy, and supporting space-related intellectual property. These achievements involve initiatives such as improving weather forecasts through commercial satellite data buys, promoting diversity and inclusion in the space industry, quantifying the U.S. space economy's contributions, and supporting commercial space innovation through intellectual property initiatives.
assistant: Some recent space accomplishments include leveraging commercial space capabilities for weather observation, fostering diversity and opportunity in the space industry, measuring the U.S. space economy, and supporting space-related intellectual property. These achievements involve initiatives such as improving weather forecasts through commercial satel

# Let's do this with proper chunking

In [11]:
from llama_index.core import Settings

Settings.chunk_size = 5000
Settings.chunk_overlap = 500

In [12]:
# Create a VectorStoreIndex object from the documents. This will involve processing the documents
# and creating a vector representation for each of them, suitable for semantic searching.
index = VectorStoreIndex.from_documents(documents)

In [13]:
chat_engine = index.as_chat_engine()

In [14]:
response = chat_engine.chat("Tell me about the most recent accomplishments in the space sector")
Markdown(response.response)

The most recent accomplishments in the space sector include leveraging commercial space capabilities for weather observation, fostering diversity and opportunity in the space industry, measuring the U.S. space economy, and supporting space-related intellectual property. These achievements highlight advancements in weather forecasting accuracy, efforts to promote diversity and inclusion in the space workforce, quantifying the contributions of the space economy, and supporting innovation through intellectual property initiatives.

In [15]:
response = chat_engine.chat("Can you expand on that?")
Markdown(response.response)

Here are more details on the recent accomplishments in the space sector:

1. **Commercial Space Capabilities for Weather Observation**:
   - NOAA utilized commercial satellite data to enhance weather forecasts and support the development of commercial markets. This included ordering radio occultation satellite data to improve forecasting accuracy and effectiveness, as well as purchasing commercial satellite data for various requirements such as space weather, ocean surface winds, and microwave sounding.

2. **Diversity and Opportunity in the Space Industry**:
   - Initiatives were undertaken to broaden participation in the space industry workforce and supplier base. Efforts included collaborating with organizations focused on increasing diversity, equity, and inclusion, hosting events celebrating women and African American contributions to space, supporting fellowship programs, and forming partnerships to connect minority business enterprises to NASA opportunities.

3. **Measuring the U.S. Space Economy**:
   - The Bureau of Economic Analysis (BEA) quantified the U.S. space economy by publishing annual statistics that measured its contributions to GDP, employment, and other key measures. This data provides insights to inform decision makers in government and industry.

4. **Supporting Space-Related Intellectual Property**:
   - The U.S. Patent and Trademark Office (USPTO) supported commercial space innovation through stakeholder initiatives aimed at reducing barriers to the intellectual property landscape. This included working groups on accelerating commercial space innovation, IP seminars at the 2023 Paris Airshow, and an international dialogue focused on the intersection of IP and the expanding commercial space sector.

In [16]:
response = chat_engine.chat("Tell me more about international space business partnerships.")
Markdown(response.response)

International space business partnerships involve the Department of Commerce organizing and leading international commercial space dialogues with multiple nations to promote business partnerships and strengthen diplomatic ties. The nations engaged in these dialogues include Australia, Canada, France, Germany, India, Japan, New Zealand, Philippines, Republic of Korea, Singapore, as well as nations from the African Union. Additionally, the International Trade Administration (ITA) promotes U.S. aerospace trade interests and manages active space-related cases to support contract wins worth billions of dollars, which in turn support thousands of U.S. jobs.

# VectorDB

In [17]:
%%capture
!pip install llama-index-vector-stores-weaviate

In [18]:
%%capture
!pip install llama-index-vector-stores-chroma

In [19]:
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext

chroma_client = chromadb.PersistentClient()
chroma_collection = chroma_client.create_collection("tech16example")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [20]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("./").load_data("2021-2024-Space-Accomplishments")
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

Loading files: 100%|██████████| 1/1 [00:04<00:00,  4.99s/file]


In [21]:
chat_engine = index.as_chat_engine()
response = chat_engine.chat("Tell me about US Space economy.")
Markdown(response.response)

The US space economy is quantified by the Bureau of Economic Analysis (BEA) in terms of its contributions to GDP, employment, and other key measures. The US Patent and Trademark Office (USPTO) supports commercial space innovation through initiatives aimed at reducing barriers to the intellectual property landscape. The Bureau of Industry and Security (BIS) conducts assessments to provide insights into the health of the US space supply chain.

# Youtube Transcript Loader

In [22]:
%%capture
!pip install llama-hub-youtube-transcript
!pip install llama-index-readers-youtube-transcript

In [23]:
from llama_index.readers.youtube_transcript import YoutubeTranscriptReader

loader = YoutubeTranscriptReader()
documents = loader.load_data(
    ytlinks=["https://www.youtube.com/watch?v=1Ul2tR7qxqM"]
)

In [24]:
index = VectorStoreIndex.from_documents(documents)

# Convert the VectorStoreIndex object into a query engine. This query engine can be used to
# perform semantic searches on the index, matching natural language queries to the most relevant
# documents in the index.
chat_engine = index.as_chat_engine()

# Use the query engine to search for documents that are relevant to the query
# from the indexed documents based on the semantic understanding of the query.
response = chat_engine.query("What has JWST discovered so far?")

# Print the response obtained from the query. This will display the result of the semantic search,
# showing the information or documents that best match the query about the 2025 outlook.
Markdown(response.response)

The James Webb Space Telescope has made several discoveries, including a hungry black hole in a dwarf galaxy, intricate details of galaxies interacting with each other, a giant asteroid emitting jets, a galaxy with gas brighter than its stars, a pair of galaxies crashing into each other, star-forming areas at the edge of the Milky Way, and many more fascinating findings.

In [25]:
response = chat_engine.chat("Give me details of the most recent discoveries.")
Markdown(response.response)

The most recent discoveries by the James Webb Space Telescope include capturing a detailed image of a mini Neptune exoplanet with a highly reflective atmosphere likely composed of water vapor, and providing a clear view of Neptune's rings and moons.

In [26]:
response = chat_engine.chat("Please include more details to your previous response.")
Markdown(response.response)

The most recent discoveries by the James Webb Space Telescope include capturing a detailed image of a mini Neptune exoplanet located 40 light years from Earth. The planet has a highly reflective atmosphere likely composed of water vapor and a thick layer of clouds or haze. Additionally, JWST provided a clear view of Neptune's rings, showcasing the planet's narrow rings, fainter dust bands, and revealing seven of Neptune's 14 known moons.

## arXiv Reader
Loading High Energy Astrophysics papers from arXiv into the Vector DB.

In [27]:
%%capture
!pip install llama-index-readers-papers

In [28]:
from llama_index.readers.papers import ArxivReader

loader = ArxivReader()
documents = loader.load_data(search_query="cat:astro-ph.HE") # Category: High energy astrophysics

In [29]:
chroma_client = chromadb.PersistentClient()
chroma_collection = chroma_client.create_collection("astro_db")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [30]:
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

In [31]:
chat_engine = index.as_chat_engine()
response = chat_engine.chat(
    "What are some of the latest breakthroughs in high energy astrophysics?"
    )
Markdown(response.response)

Recent advancements in high-energy astrophysics include studies on high-energy neutrinos originating from astrophysical sources like gamma-ray bursts (GRBs). Researchers have been investigating the detection and analysis of high-energy neutrinos to understand the processes occurring in these extreme cosmic events. Additionally, there have been developments in the observation of high-energy gamma-ray photons produced in various astrophysical phenomena, shedding light on the high-energy processes in the universe. These breakthroughs contribute to our understanding of the most energetic events in the cosmos and the fundamental physics governing them.

In [32]:
response = chat_engine.chat(
    "Can you elaborate on that?"
    )
Markdown(response.response)

In high-energy astrophysics, researchers have made significant progress in studying high-energy neutrinos that originate from astrophysical sources such as gamma-ray bursts (GRBs). Neutrinos are elusive subatomic particles that can provide valuable information about the extreme processes taking place in the universe. By detecting and analyzing high-energy neutrinos, scientists can gain insights into the mechanisms and conditions present in these cosmic events.

Gamma-ray bursts are among the most energetic phenomena in the universe, releasing intense bursts of gamma-ray radiation. Studying the high-energy neutrinos associated with GRBs can help researchers understand the underlying physics of these explosive events. Detecting neutrinos from GRBs can provide clues about the acceleration mechanisms, particle interactions, and energy transfer processes involved in these high-energy astrophysical phenomena.

Furthermore, advancements in observing high-energy gamma-ray photons have also contributed to our understanding of astrophysical processes. Gamma-ray photons are emitted during various cosmic events, including supernova explosions, active galactic nuclei, and gamma-ray bursts. By studying these high-energy photons, scientists can investigate the sources of gamma-ray emission, the properties of the surrounding environments, and the dynamics of the high-energy processes occurring in the universe.

Overall, the recent breakthroughs in high-energy astrophysics, particularly in the detection and analysis of high-energy neutrinos and gamma-ray photons, have provided valuable insights into the most energetic events in the cosmos. These advancements enhance our understanding of the fundamental physics governing the universe's extreme phenomena and contribute to unraveling the mysteries of high-energy astrophysical processes.