<a href="https://colab.research.google.com/github/XVI-Adam/MU_RAG_Workshop/blob/main/ManhattanUniversity_RAG_Workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### What is RAG anyway?


Retrieval-Augmented Generation (RAG) is a technique primarily used in GenAI applications to improve the quality and accuracy of generated text by LLMs by combining two key processes: retrieval and generation.

### Breaking It Down:
#### Retrieval:

- Before generating a response, the system first looks up relevant information from a large database or knowledge base. This is like searching through a library or the internet to find the most useful facts, articles, or data related to the question or topic.

#### Generation:

- Once the relevant information is retrieved, the system then uses it to help generate a response. This is where the model, like GPT, creates new text (answers, explanations, etc.) based on the retrieved information.

**Other Resources:**
- [Get your OpenAI API Key](https://platform.openai.com/settings/profile?tab=api-keys)
- [Get your Pinecone API Key](https://www.pinecone.io/)
- [Get your OpenRouter API Key](https://openrouter.ai/settings/keys)
- [JavaScript Code for RAG](https://js.langchain.com/v0.2/docs/tutorials/rag)
- [RAG with an in-memory database in Next.js](https://sdk.vercel.ai/examples/node/generating-text/rag)


#### Install relevant libraries

In [None]:
! pip install langchain langchain-community openai tiktoken pinecone-client langchain_pinecone unstructured pdfminer==20191125 pdfminer.six==20221105 pillow_heif unstructured_inference youtube-transcript-api pytube sentence-transformers

Collecting langchain
  Downloading langchain-0.3.4-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.3-py3-none-any.whl.metadata (2.8 kB)
Collecting openai
  Downloading openai-1.52.1-py3-none-any.whl.metadata (24 kB)
Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting pinecone-client
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Collecting langchain_pinecone
  Downloading langchain_pinecone-0.2.0-py3-none-any.whl.metadata (1.7 kB)
Collecting unstructured
  Downloading unstructured-0.16.1-py3-none-any.whl.metadata (24 kB)
Collecting pdfminer==20191125
  Downloading pdfminer-20191125.tar.gz (4.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pdfminer.six==20221105
  Downloading pdfmi

In [None]:
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, WebBaseLoader, YoutubeLoader, DirectoryLoader, TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sklearn.metrics.pairwise import cosine_similarity
from langchain_pinecone import PineconeVectorStore
from langchain.vectorstores import Pinecone
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from google.colab import userdata
from pinecone import Pinecone
from openai import OpenAI
import numpy as np
import tiktoken
import os
import pinecone

pinecone_api_key = userdata.get("PINECONE_API_KEY")
os.environ['PINECONE_API_KEY'] = pinecone_api_key
os.environ['PINECONE_ENVIRONMENT'] = "us-east-1"

openai_api_key = userdata.get("OPENAI_API_KEY")
os.environ['OPENAI_API_KEY'] = openai_api_key



# Initialize the OpenAI client

In [None]:
embeddings = OpenAIEmbeddings()
embed_model = "text-embedding-3-small"
openai_client = OpenAI()

  embeddings = OpenAIEmbeddings()


# Use HuggingFace & OpenRouter if you don't have an OpenAI account with credits
## API Keys:
OpenAI:
https://platform.openai.com/settings/profile?tab=api-keys

Openrouter:
https://openrouter.ai/

Pinecone:
https://www.pinecone.io/



In [None]:
# HuggingFace Embeddings
# Use this instead of OpenAI embeddings if you don't have an OpenAI account with credits

text = "This is a test document."

hf_embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/stsb-bert-large")
query_result = hf_embeddings.embed_query(text)

  hf_embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/stsb-bert-large")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.94k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/379 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

In [None]:
query_result

[-0.21815195679664612,
 0.49602392315864563,
 0.72919100522995,
 0.12342813611030579,
 -1.2941139936447144,
 0.387632817029953,
 0.05223773419857025,
 0.16726770997047424,
 -1.1128382682800293,
 0.3223240375518799,
 -0.42982977628707886,
 0.5426003932952881,
 0.00623224675655365,
 0.06460632383823395,
 -0.012242259457707405,
 0.4685819149017334,
 -0.23193258047103882,
 0.3035549223423004,
 -1.9791057109832764,
 -0.34733954071998596,
 0.1305180937051773,
 0.44956517219543457,
 0.14612305164337158,
 -0.0918344259262085,
 -0.45842793583869934,
 -0.023911433294415474,
 -0.6461336016654968,
 0.5481970906257629,
 -0.11582361161708832,
 1.1632585525512695,
 -0.07838872820138931,
 0.2397138476371765,
 -1.1591699123382568,
 0.1624138206243515,
 -0.42054155468940735,
 -0.8614808917045593,
 0.6483257412910461,
 0.06850463896989822,
 0.6680849194526672,
 -0.5252474546432495,
 0.8910707831382751,
 0.9964404106140137,
 -0.05494459718465805,
 -0.09811010211706161,
 -0.3498879075050354,
 -0.8370872735

In [None]:
# Use this instead of OpenAI if you don't have an OpenAI account with credits

openrouter_client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=userdata.get("OPENAI_API_KEY")
)

## Initialize our text splitter
This is how we will chunk up the text to be retrieved during the RAG process

In [None]:
tokenizer = tiktoken.get_encoding('p50k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=2000,
        chunk_overlap=100,
        length_function=tiktoken_len,
        separators=["\n\n", "\n", " ", ""]
)

# Understanding Embeddings


In [None]:
def get_embedding(text, model="text-embedding-3-small"):
    # Call the OpenAI API to get the embedding for the text
    response = openai_client.embeddings.create(input=text, model=model)
    return response.data[0].embedding

# Function to get embeddings using HuggingFace
# def get_embedding(text):
#     return hf_embeddings.embed_query(text)

# def cosine_similarity_between_words(sentence1, sentence2):
#     # Get embeddings for both words
#     embedding1 = np.array(get_embedding(sentence1))
#     embedding2 = np.array(get_embedding(sentence2))

#     # Reshape embeddings for cosine_similarity function
#     embedding1 = embedding1.reshape(1, -1)
#     embedding2 = embedding2.reshape(1, -1)

#     print("Embedding for Sentence 1:", embedding1)
#     print("\nEmbedding for Sentence 2:", embedding2)

    # # Calculate cosine similarity
    # similarity = cosine_similarity(embedding1, embedding2)
    # return similarity[0][0]

def cosine_similarity_between_words(sentence1, sentence2): # This function was commented out, causing the NameError. Uncommenting it should resolve the issue.
    # Get embeddings for both words
    embedding1 = np.array(get_embedding(sentence1))
    embedding2 = np.array(get_embedding(sentence2))

    # Reshape embeddings for cosine_similarity function
    embedding1 = embedding1.reshape(1, -1)
    embedding2 = embedding2.reshape(1, -1)

    print("Embedding for Sentence 1:", embedding1)
    print("\nEmbedding for Sentence 2:", embedding2)

    # Calculate cosine similarity
    similarity = cosine_similarity(embedding1, embedding2)
    return similarity[0][0]

# Example usage
sentence1 = "I hate studying at school"
sentence2 = "I like running to the office"


similarity = cosine_similarity_between_words(sentence1, sentence2)
print(f"\n\nCosine similarity between '{sentence1}' and '{sentence2}': {similarity:.4f}")


Embedding for Sentence 1: [[-0.02592939 -0.02917057  0.01211917 ...  0.04695005 -0.0213025
   0.02994563]]

Embedding for Sentence 2: [[ 0.00702975  0.01988129  0.03200943 ... -0.00181888 -0.02809132
   0.00573872]]


Cosine similarity between 'I hate studying at school' and 'I like running to the office': 0.3447


# Load in a YouTube video and get its transcript

In [None]:
# Load in a YouTube video's transcript
loader = YoutubeLoader.from_youtube_url("https://www.youtube.com/watch?v=N6dOwBde7-M&ab_channel=BroCode", add_video_info=True)
data = loader.load()

print(data)

[Document(metadata={'source': 'N6dOwBde7-M', 'title': 'Learn Linked Lists in 13 minutes 🔗', 'description': 'Unknown', 'view_count': 322878, 'thumbnail_url': 'https://i.ytimg.com/vi/N6dOwBde7-M/hq720.jpg', 'publish_date': '2021-04-19 00:00:00', 'length': 804, 'author': 'Bro Code'}, page_content="hey what's going on everybody it's you bro hope you're doing well in this video we're going to discuss linked lists and computer science so sit back relax and enjoy the show now before we dive straight into linked lists we're going to take a closer examination of arrays and array lists we will see what disadvantages that these data structures have where linked lists excel at so we'll compare and contrast the differences between the two with what we understand with arrays and array lists these data structures store elements in contiguous memory locations in this demonstration i'm storing letters of the alphabet suppose that the first element of my array has a memory address of one two three fake 

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # Adjust chunk size as needed
    chunk_overlap=200  # Adjust overlap as needed
)
texts = text_splitter.split_documents(data)

In [None]:
texts

[Document(metadata={'source': 'N6dOwBde7-M', 'title': 'Learn Linked Lists in 13 minutes 🔗', 'description': 'Unknown', 'view_count': 322878, 'thumbnail_url': 'https://i.ytimg.com/vi/N6dOwBde7-M/hq720.jpg', 'publish_date': '2021-04-19 00:00:00', 'length': 804, 'author': 'Bro Code'}, page_content="hey what's going on everybody it's you bro hope you're doing well in this video we're going to discuss linked lists and computer science so sit back relax and enjoy the show now before we dive straight into linked lists we're going to take a closer examination of arrays and array lists we will see what disadvantages that these data structures have where linked lists excel at so we'll compare and contrast the differences between the two with what we understand with arrays and array lists these data structures store elements in contiguous memory locations in this demonstration i'm storing letters of the alphabet suppose that the first element of my array has a memory address of one two three fake 

# Initialize Pinecone

In [None]:
vectorstore = PineconeVectorStore(index_name="rag", embedding=hf_embeddings)

index_name = "rag"

namespace = "youtube-videos"

# Insert data into Pinecone

Documentation: https://docs.pinecone.io/integrations/langchain#key-concepts

In [None]:
for document in texts:
    print("\n\n\n\n----")

    print(document.metadata, document.page_content)

    print('\n\n\n\n----')





----
{'source': 'N6dOwBde7-M', 'title': 'Learn Linked Lists in 13 minutes 🔗', 'description': 'Unknown', 'view_count': 322878, 'thumbnail_url': 'https://i.ytimg.com/vi/N6dOwBde7-M/hq720.jpg', 'publish_date': '2021-04-19 00:00:00', 'length': 804, 'author': 'Bro Code'} hey what's going on everybody it's you bro hope you're doing well in this video we're going to discuss linked lists and computer science so sit back relax and enjoy the show now before we dive straight into linked lists we're going to take a closer examination of arrays and array lists we will see what disadvantages that these data structures have where linked lists excel at so we'll compare and contrast the differences between the two with what we understand with arrays and array lists these data structures store elements in contiguous memory locations in this demonstration i'm storing letters of the alphabet suppose that the first element of my array has a memory address of one two three fake street obviously these ar

In [None]:
vectorstore_from_texts = PineconeVectorStore.from_texts([f"Source: {t.metadata['source']}, Title: {t.metadata['title']} \n\nContent: {t.page_content}" for t in texts], hf_embeddings, index_name=index_name, namespace="youtube-videos")

PineconeApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Date': 'Wed, 23 Oct 2024 21:44:12 GMT', 'Content-Type': 'application/json', 'Content-Length': '104', 'Connection': 'keep-alive', 'x-pinecone-request-latency-ms': '202', 'x-pinecone-request-id': '1704383313124997174', 'x-envoy-upstream-service-time': '48', 'server': 'envoy'})
HTTP response body: {"code":3,"message":"Vector dimension 1024 does not match the dimension of the index 1536","details":[]}


# Perform RAG

In [None]:
from pinecone import Pinecone, ServerlessSpec

In [None]:
# Initialize Pinecone
pc = Pinecone(api_key=userdata.get("PINECONE_API_KEY"),)


# Connect to your Pinecone index
pinecone_index = pc.Index("rag")

In [None]:
query = "How do I make a linked list?"

In [None]:
#Using OpenAI
raw_query_embedding = openai_client.embeddings.create(
    input=[query],
    model="text-embedding-3-small"
)

query_embedding = raw_query_embedding.data[0].embedding

#Using HuggingFace
# query_embedding = hf_embeddings.embed_query(query)


In [None]:
query_embedding

[-0.008721982128918171,
 -0.024088166654109955,
 0.029769781976938248,
 -0.015929650515317917,
 0.015002279542386532,
 -0.011909086257219315,
 -0.022127775475382805,
 0.009432184509932995,
 -0.03099062480032444,
 0.031154967844486237,
 0.01807786338031292,
 0.04322252795100212,
 0.013217970728874207,
 -0.03676614910364151,
 0.005581833887845278,
 -0.05249623954296112,
 0.020918671041727066,
 -2.787982339214068e-05,
 0.039137400686740875,
 0.09142234176397324,
 0.018336119130253792,
 0.0367426723241806,
 0.04228341951966286,
 0.024769021198153496,
 -0.015483573079109192,
 -0.010482813231647015,
 0.007794611621648073,
 0.03153061121702194,
 -0.03333839774131775,
 -0.0060983444564044476,
 0.019134363159537315,
 -0.03740004822611809,
 -0.0010154125047847629,
 0.05038324370980263,
 0.0066031161695718765,
 -0.0040939319878816605,
 0.05334143713116646,
 -0.02026129513978958,
 0.013276665471494198,
 0.03967738896608353,
 -0.04247124120593071,
 -0.027257662266492844,
 0.044560760259628296,
 -0.

In [None]:
top_matches = pinecone_index.query(vector=query_embedding, top_k=10, include_metadata=True, namespace="youtube-videos")

In [None]:
top_matches

{'matches': [], 'namespace': 'youtube-videos', 'usage': {'read_units': 1}}

In [None]:
# Get the list of retrieved texts
contexts = [item['metadata']['text'] for item in top_matches['matches']]

In [None]:
contexts

[]

In [None]:
augmented_query = "<CONTEXT>\n" + "\n\n-------\n\n".join(contexts[ : 10]) + "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query

In [None]:
print(augmented_query)

<CONTEXT>

-------
</CONTEXT>



MY QUESTION:
How do I make a linked list?


In [None]:
# Modify the prompt below as need to improve the response quality

primer = f"""You are a personal assistant. Answer any questions I have about the Youtube Video provided. You always
answer questions based only on the context that you have been provided.
"""
res = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": primer},
        {"role": "user", "content": augmented_query}
    ]
)

openai_answer = res.choices[0].message.content


In [None]:
print(openai_answer)

I'm sorry, I don't have any information or context from the video provided to answer your question about creating a linked list. Could you provide me with more details or context from the video? Alternatively, I can offer a general explanation on how to make a linked list if that would be helpful.


# Using OpenRouter

In [None]:
 # Check out different models here: https://openrouter.ai/docs/models

# Modify the prompt below as need to improve the response quality

primer = f"""You are a personal assistant. Answer questions based on the YouTube video provided. Always include:
1. Your answer.
2. A direct quote from the video that supports your answer. If a direct quote is unavailable, mention it explicitly.
"""

res = openrouter_client.chat.completions.create(
    model="meta-llama/llama-3.1-8b-instruct:free",
    messages=[
        {"role": "system", "content": primer},
        {"role": "user", "content": augmented_query}
    ]
)

answer = res.choices[0].message.content

AuthenticationError: Error code: 401 - {'error': {'message': 'Missing Authentication header or invalid API key', 'code': 401}}

In [None]:
print(answer)

NameError: name 'answer' is not defined

# Putting it all together

In [None]:
def perform_rag(query):
    raw_query_embedding = openai_client.embeddings.create(
        input=query,
        model="text-embedding-3-small"
    )

    query_embedding = raw_query_embedding.data[0].embedding

    top_matches = pinecone_index.query(vector=query_embedding, top_k=10, include_metadata=True, namespace="youtube-videos")

    # Get the list of retrieved texts
    contexts = [item['metadata']['text'] for item in top_matches['matches']]

    augmented_query = "<CONTEXT>\n" + "\n\n-------\n\n".join(contexts[ : 10]) + "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query

    # Modify the prompt below as need to improve the response quality
    system_prompt = f"""You are an expert personal assistant. Answer any questions I have about the Youtube Video provided. You always answer questions based only on the context that you have been provided.
    """

    res = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": augmented_query}
        ]
    )

    return res.choices[0].message.content


In [None]:
perform_rag("How do I make a linked list?")

"I'm sorry, I don't have information from a video or context to guide you on making a linked list. However, I can give you a brief explanation on how to create a linked list in a general programming context. Here’s a simple way to explain it:\n\nA linked list typically consists of nodes where each node contains two parts: data and a reference (or link) to the next node in the sequence.\n\nHere’s a basic example in Python:\n\n1. **Define a Node class:**\n\n```python\nclass Node:\n    def __init__(self, data):\n        self.data = data\n        self.next = None\n```\n\n2. **Define a LinkedList class:**\n\n```python\nclass LinkedList:\n    def __init__(self):\n        self.head = None\n\n    def append(self, data):\n        new_node = Node(data)\n        if not self.head:\n            self.head = new_node\n            return\n        last_node = self.head\n        while last_node.next:\n            last_node = last_node.next\n        last_node.next = new_node\n\n    def print_list(self):\