<a href="https://colab.research.google.com/github/dhorvath/AI-Stuff/blob/main/Simple%20RAG%20and%20exploring%20llamaindex%20data%20loaders.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

In [25]:
# Install all the things
!pip install openai
!pip install sentence-transformers
!pip install langchain pypdf langchain-openai #tiktoken chromadb
!pip install llama-index --upgrade
!pip install PyPDF2
!pip install textwrap3



In [26]:
# Imports
from openai import OpenAI
from google.colab import userdata
from google.colab import files
import os
import pypdf

# API
open_ai_key = userdata.get('open_ai_key')
client = OpenAI(api_key=open_ai_key)

os.environ["OPENAI_API_KEY"] = open_ai_key

In [29]:
# Helper
import textwrap3
def wrap_print(long_text):
  print('\n'.join(textwrap3.wrap(long_text)))

# Load data

In [3]:
import PyPDF2
def upload_pdfs():
    uploaded = files.upload()
    pdf_files = [f for f in uploaded.keys() if f.endswith('.pdf')]
    return pdf_files

In [50]:
print("Please upload PDF documents:")
pdf_files = upload_pdfs()

Please upload PDF documents:


Saving san_francisco-ca-2 (1).pdf to san_francisco-ca-2 (1).pdf


In [51]:
!ls

 chroma   drive   sample_data  'san_francisco-ca-2 (1).pdf'


# Index data and create query engine


In [None]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# Read documents from the specified directory and load a specific document, "report.pdf".
documents = SimpleDirectoryReader("./").load_data("*.pdf")

# Create a VectorStoreIndex object from the documents. This will involve processing the documents
# and creating a vector representation for each of them, suitable for semantic searching.
index = VectorStoreIndex.from_documents(documents)

# Convert the VectorStoreIndex object into a query engine. This query engine can be used to
# perform semantic searches on the index, matching natural language queries to the most relevant
# documents in the index.
query_engine = index.as_query_engine()

Loading files:   0%|          | 0/1 [00:00<?, ?file/s]

# Run a query

In [32]:
# Use the query engine to search for documents that are relevant to the query
# from the indexed documents based on the semantic understanding of the query.
response = query_engine.query("What might most San Francisco residents find most interesting about the city charter? Be concise.")

# Print the response obtained from the query. This will display the result of the semantic search,
wrap_print(str(response))

Most San Francisco residents might find the provisions related to the
Small Business Commission and the Historic Preservation Commission in
the city charter most interesting.


# Adding Memory / Asking Questions

In [34]:
chat_engine = index.as_chat_engine(chat_mode="openai", verbose=True)
response = chat_engine.chat("What might most San Francisco residents find most interesting about the city charter? Be concise.")
wrap_print(str(response))

Added user message to memory: What might most San Francisco residents find most interesting about the city charter? Be concise.
=== Calling Function ===
Calling function: query_engine_tool with args: {"input":"Most interesting aspects of the San Francisco city charter for residents"}
Got output: Residents of San Francisco may find the provisions related to the Small Business Commission and the Historic Preservation Commission to be particularly interesting. The Small Business Commission oversees the San Francisco Office of Small Business and consists of members appointed by the Mayor and the Board of Supervisors, reflecting the diversity of neighborhood and small business interests in the city. On the other hand, the Historic Preservation Commission advises the city on historic preservation matters and participates in processes involving historic or cultural resources, showcasing the city's commitment to preserving its heritage.

Residents of San Francisco may find the provisions relat

In [35]:
response = chat_engine.chat("Can you expand on that?")
wrap_print(str(response))

Added user message to memory: Can you expand on that?
The Small Business Commission in San Francisco plays a crucial role in
supporting and advocating for small businesses in the city. It
provides a platform for small business owners to voice their concerns
and needs, influencing policies and initiatives that impact the local
business community. The commission's diverse membership ensures that a
wide range of perspectives and interests are represented, contributing
to a more inclusive and equitable decision-making process.  On the
other hand, the Historic Preservation Commission is responsible for
safeguarding and promoting the city's rich cultural and architectural
heritage. By advising on matters related to historic preservation, the
commission helps maintain the unique character and identity of San
Francisco. Its involvement in processes concerning historic or
cultural resources ensures that these valuable assets are protected
and celebrated for future generations to appreciate.  Ov

In [20]:
chat_engine.chat_history

[ChatMessage(role=<MessageRole.USER: 'user'>, content='What might most San Francisco residents find most interesting about the city charter? Be concise.', additional_kwargs={}),
 ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, content=None, additional_kwargs={'tool_calls': [ChatCompletionMessageToolCall(id='call_0xcsq3QJbL97brGIjHgxzLKp', function=Function(arguments='{"input":"Most interesting aspects of the San Francisco city charter for residents"}', name='query_engine_tool'), type='function')]}),
 ChatMessage(role=<MessageRole.TOOL: 'tool'>, content='Residents of San Francisco may find the provisions related to the Small Business Commission and the Historic Preservation Commission to be particularly interesting. The Small Business Commission oversees the San Francisco Office of Small Business and consists of members appointed by the Mayor and the Board of Supervisors, reflecting the diversity of neighborhood and small business interests in the city. On the other hand, the His

# Vector DB

In [21]:
!pip install llama-index-vector-stores-weaviate
!pip install llama-index-vector-stores-chroma

Collecting llama-index-vector-stores-weaviate
  Downloading llama_index_vector_stores_weaviate-1.0.2-py3-none-any.whl.metadata (717 bytes)
Collecting weaviate-client<5.0.0,>=4.5.7 (from llama-index-vector-stores-weaviate)
  Downloading weaviate_client-4.7.1-py3-none-any.whl.metadata (3.3 kB)
Collecting validators==0.33.0 (from weaviate-client<5.0.0,>=4.5.7->llama-index-vector-stores-weaviate)
  Downloading validators-0.33.0-py3-none-any.whl.metadata (3.8 kB)
Collecting authlib<2.0.0,>=1.2.1 (from weaviate-client<5.0.0,>=4.5.7->llama-index-vector-stores-weaviate)
  Downloading Authlib-1.3.1-py2.py3-none-any.whl.metadata (3.8 kB)
Collecting grpcio-tools<2.0.0,>=1.57.0 (from weaviate-client<5.0.0,>=4.5.7->llama-index-vector-stores-weaviate)
  Downloading grpcio_tools-1.65.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.3 kB)
Collecting grpcio-health-checking<2.0.0,>=1.57.0 (from weaviate-client<5.0.0,>=4.5.7->llama-index-vector-stores-weaviate)
  Downloading grpci

In [22]:
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext

chroma_client = chromadb.PersistentClient()
chroma_collection = chroma_client.create_collection("class3example")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [38]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("./").load_data("*.pdf")
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

Loading files: 100%|██████████| 1/1 [02:10<00:00, 130.84s/file]


In [37]:
query_engine = index.as_query_engine()

response = query_engine.query("What might most San Francisco residents find most interesting about the city charter? Be concise.")
wrap_print(str(response))

The most interesting aspect of the city charter for most San Francisco
residents might be the detailed provisions regarding the status and
rights of incumbent officers and employees during the transition
period when the charter becomes operative.


# Extra Credit: Youtube Video Loader by video URL
### Check out more data loaders here: https://llamahub.ai/?tab=readers

In [49]:
!pip install llama-index-readers-youtube-transcript
!pip install youtube_transcript_api



In [40]:
from llama_index.readers.youtube_transcript import YoutubeTranscriptReader

def get_youtube_transcript(youtube_url):
  loader = YoutubeTranscriptReader()
  documents = loader.load_data(ytlinks=[youtube_url])
  transcript = documents[0].text
  return transcript

In [41]:
def generate_summary(text):
    openai.api_key = open_ai_key

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that summarizes text."},
            {"role": "user", "content": f"Please summarize the following transcript and be concise:\n\n{text}"}
        ],
        max_tokens=500
    )

    return response.choices[0].message.content

In [42]:
def summarize_youtube_video(youtube_url):
    transcript = get_youtube_transcript(youtube_url)
    summary = generate_summary(transcript)
    return summary

In [43]:
from IPython.display import Markdown, display

In [45]:
import openai

In [46]:
youtube_url = "https://www.youtube.com/watch?v=GeAaM4vLXZE"
summary = summarize_youtube_video(youtube_url)
display(Markdown(summary))
#print(summary)

In this segment of "Home Movies," host Allison Roman discusses making a tomato tart during peak tomato season. She expresses a preference for rearranging floral arrangements for aesthetic appeal but quickly shifts to cooking. The tart features a crust similar to a Cheez-It, made from flour, polenta, Parmesan, and butter, and highlights the importance of using water to bind it. After preparing and baking the crust, Roman slices various tomatoes for the filling, recommending their use regardless of quality. The tart is baked with the tomatoes seasoned with garlic, olive oil, and optional capers. After baking, she emphasizes serving it warm or at room temperature. She offers insights on cooking with leftover ingredients for a salad and encourages viewers to enjoy the tart with different side dishes. The episode celebrates the joys of cooking with fresh, seasonal produce.