<a href="https://colab.research.google.com/github/aicrashcoursewinter24/ai_crashcourselabsLukeA/blob/AGENT_RAG_01-19/NEW_RAG_first_steps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
! pip install numexpr openai langchain_openai langchain langchain-community duckduckgo-search loguru qdrant-client sentence-transformers tiktoken docx2txt unstructured tqdm

Collecting openai
  Using cached openai-1.9.0-py3-none-any.whl (223 kB)
Collecting langchain_openai
  Using cached langchain_openai-0.0.3-py3-none-any.whl (28 kB)
Collecting langchain
  Using cached langchain-0.1.1-py3-none-any.whl (802 kB)
Collecting langchain-community
  Downloading langchain_community-0.0.13-py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting duckduckgo-search
  Downloading duckduckgo_search-4.2-py3-none-any.whl (25 kB)
Collecting loguru
  Downloading loguru-0.7.2-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting qdrant-client
  Downloading qdrant_client-1.7.1-py3-none-any.whl (205 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m205.9/205.9 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers
  

In [3]:
from typing import List, Dict, Any

import xml.etree.ElementTree as ET
import json

from langchain_core.documents import Document
from langchain_community.document_loaders.base import BaseLoader
from langchain_core.embeddings import Embeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Qdrant
from langchain.tools.retriever import create_retriever_tool


def parse_xml_to_dict(file_path: str) -> Dict[str, Any]:
    # Parse the XML file
    tree = ET.parse(file_path)
    root = tree.getroot()

    # Function to parse each review element
    def parse_review(review_element: ET.Element) -> Dict[str, Any]:
        review_data = {}
        for child in review_element:
            if child.tag == 'reviewer':
                # Special handling for nested 'reviewer' tag
                reviewer_data = {grandchild.tag: grandchild.text for grandchild in child}
                review_data[child.tag] = reviewer_data
            else:
                review_data[child.tag] = child.text
        return review_data

    # Parse all reviews
    reviews = [parse_review(review) for review in root]
    return reviews


class XMLReviewLoader(BaseLoader):
    def __init__(self, file_path: str):
        self.file_path = file_path

    def load(self) -> List[Document]:
        docs = parse_xml_to_dict(self.file_path)
        return [Document(page_content=json.dumps(doc)) for doc in docs]


class VerboseEmbeddings(Embeddings):
    def __init__(self, embeddings: Embeddings):
        self.embeddings = embeddings

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        print(f"Embedding {len(texts)} documents")
        return self.embeddings.embed_documents(texts)

    def embed_query(self, text: str) -> List[float]:
        print(f"Embedding query: {text}")
        return self.embeddings.embed_query(text)


def build_vectorstore_from_xml_directory(path: str, vectorstore_path: str, loading_glob="**/*.xml",
                                         collection_name="my_documents",
                                         embeddings: Embeddings = VerboseEmbeddings(HuggingFaceEmbeddings())):
    # setting up a document loader which loads all xml reviews in the directory
    dir_loader = DirectoryLoader(path=path, glob=loading_glob, loader_cls=XMLReviewLoader,
                                 show_progress=True, use_multithreading=True)
    docs = dir_loader.load()
    # splitting the xml documents into chunks of 1000 characters with an overlap of 200 characters
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=3000, chunk_overlap=100, add_start_index=True
    )
    all_splits: List[Document] = text_splitter.split_documents(docs)
    print(f"Split {len(docs)} documents into {len(all_splits)} chunks")
    # setting up a vectorstore with the Qdrant backend
    vectorstore = Qdrant.from_documents(
        all_splits,
        embeddings,
        path=vectorstore_path,
        collection_name=collection_name
    )
    # search args could return a different number of nearest neighbors.  Arbitrarily putting "6" here.
    retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})
    return retriever


def retriever_tool(retriever):
    """Creates a tool that searches the vectorstore for documents similar to the input text."""
    tool = create_retriever_tool(
        retriever,
        "search_xml_documents",
        "Searches inside of our local store old Best Buy reviews.  " +
        "Useful for understanding how people talk about products from the early 2010's era",
    )
    return tool



.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [9]:
!ls drive/MyDrive/CSC480-RAG

reviews_0001_24122_to_98772.xml      reviews_0024_5130266_to_5247642.xml
reviews_0002_98780_to_345240.xml     reviews_0025_5247643_to_5363104.xml
reviews_0003_345248_to_501823.xml    reviews_0026_5363109_to_5806838.xml
reviews_0004_501835_to_691133.xml    reviews_0027_5806868_to_6022633.xml
reviews_0005_691144_to_831905.xml    reviews_0028_6022641_to_6172272.xml
reviews_0006_831928_to_1117787.xml   reviews_0029_6172275_to_6358111.xml
reviews_0007_1117796_to_1382936.xml  reviews_0030_6358125_to_6396149.xml
reviews_0008_1382981_to_1597862.xml  reviews_0031_6396154_to_6423017.xml
reviews_0009_1597878_to_1906168.xml  reviews_0032_6423025_to_6486864.xml
reviews_0010_1906234_to_2063675.xml  reviews_0033_6486865_to_6587831.xml
reviews_0011_2063679_to_2151936.xml  reviews_0034_6587839_to_6701463.xml
reviews_0012_2151940_to_2284992.xml  reviews_0035_6701490_to_6801194.xml
reviews_0013_2285000_to_2440628.xml  reviews_0036_6801200_to_6889357.xml
reviews_0014_2440655_to_2799035.xml  reviews_0037_6

In [10]:

review_dir = "/content/drive/MyDrive/CSC480-RAG/"


In [11]:
retriever = build_vectorstore_from_xml_directory(collection_name="test", path=review_dir, loading_glob="**/reviews_0001_24122_to_98772.xml", vectorstore_path='/content/reviews')


100%|██████████| 1/1 [00:01<00:00,  1.05s/it]


Split 5000 documents into 5024 chunks
Embedding 1 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documents
Embedding 64 documen

In [12]:
review_tool = retriever_tool(retriever)

In [13]:
import os
from getpass import getpass
from langchain.agents import AgentExecutor, AgentType, load_tools, initialize_agent
from langchain.chat_models import ChatAnyscale
from langchain_openai import OpenAIEmbeddings

In [14]:
from google.colab import userdata

llm = ChatAnyscale(anyscale_api_base="https://api.endpoints.anyscale.com/v1",
                   anyscale_api_key=userdata.get("ANYSCALE_API_KEY"),
                   model_name="mistralai/Mixtral-8x7B-Instruct-v0.1",
                   temperature=0.7,
                   verbose=True)

In [15]:
tools = load_tools(["llm-math", "ddg-search"], llm=llm) + [review_tool]

In [16]:
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)
print(agent.agent.llm_chain.prompt.template)

Answer the following questions as best you can. You have access to the following tools:

Calculator: Useful for when you need to answer questions about math.
duckduckgo_search: A wrapper around DuckDuckGo Search. Useful for when you need to answer questions about current events. Input should be a search query.
search_xml_documents: Searches inside of our local store old Best Buy reviews.  Useful for understanding how people talk about products from the early 2010's era

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [Calculator, duckduckgo_search, search_xml_documents]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: {input}
Thought:{agent_scratchpad}


  warn_deprecated(


In [18]:
agent.invoke("What is Magic: The Gathering Cedh format?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I don't have the specific information about Magic: The Gathering's Commander (EDH) format variants. I should look up the current formats in the Magic: The Gathering community.
Action: duckduckgo_search
Action Input: Magic: The Gathering Cedh format

Observation[0m
Observation: [33;1m[1;3mTabletop Magic: The Gathering - Everything You Need To Know About cEDH By Johnny Garcia Published Jul 6, 2023 Do you like playing Commander? We have a recommendation for you. Quick Links What Is cEDH? What Cards Are Banned In cEDH? What Makes A cEDH Commander? What Makes A Card cEDH Viable? How To Win cEDH Games Where To Play cEDH Commander itself is a very stable format that rares sees bans and even when they do come they rarely affect the most powerful end of games. cEDH gives you a chance to try out a powerful and dynamic form of Magic that tries to do the impossible: break a format that was purposefully meant not to be broken. What Is

{'input': 'What is Magic: The Gathering Cedh format?',
 'output': 'CEDH, or Competitive EDH, is a variant of the Commander format in Magic: The Gathering. It focuses on playing as quickly and efficiently as possible, often involving combos and less direct interaction. The philosophy of CEDH is to push the boundaries of what\'s possible in the Commander format, which is already designed to be more stable and less prone to being "broken".'}

In [None]:
# The agent appears to create thoughts as to how it will answer the question as described during class lecture.
# I found it incredibly similar to how someone would look something up on the internet.