### AFJ limited QA bot
In this section we would be experimenting with indexing of the entire AFJ limited website so that we can build a QA chatbot that can answer questions about the company

In [1]:
# needed only once to recursively download the AFJ limited website

domain = "afjltd.co.uk"
docs_url = "https://afjltd.co.uk"
!wget -e robots=off --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains {domain} --no-parent {docs_url}

Both --no-clobber and --convert-links were specified, only --convert-links will be used.
--2024-02-13 02:05:32--  https://afjltd.co.uk/
Resolving afjltd.co.uk (afjltd.co.uk)... 35.214.126.108
Connecting to afjltd.co.uk (afjltd.co.uk)|35.214.126.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘afjltd.co.uk/index.html’

afjltd.co.uk/index.     [ <=>                ] 171.39K  --.-KB/s    in 0.06s   

2024-02-13 02:05:32 (2.85 MB/s) - ‘afjltd.co.uk/index.html’ saved [175501]

--2024-02-13 02:05:32--  https://afjltd.co.uk/feed/
Reusing existing connection to afjltd.co.uk:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/rss+xml]
Saving to: ‘afjltd.co.uk/feed/index.html’

afjltd.co.uk/feed/i     [ <=>                ]  60.00K  --.-KB/s    in 0.02s   

2024-02-13 02:05:32 (2.80 MB/s) - ‘afjltd.co.uk/feed/index.html’ saved [61445]

--2024-02-13 02:05:32--  https://afjltd.co.uk/comments/feed/
Reu

In [1]:
from llama_hub.file.unstructured.base import UnstructuredReader
from llama_index.llms.openai import OpenAI
from llama_index.service_context import ServiceContext
from pathlib import Path

In [2]:
reader = UnstructuredReader()

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/kosisochukwuasuzu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/kosisochukwuasuzu/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [3]:
all_files_gen = Path("./afjltd.co.uk/").rglob("*")
all_files = [f.resolve() for f in all_files_gen]

In [4]:
len(all_files)

68

In [5]:
suffixes = set([file.suffix for file in all_files])

In [6]:
suffixes

{'', '.html'}

In [7]:
all_html_files = [file for file in all_files if file.suffix.lower() == ".html"]

In [8]:
len(all_html_files)

32

In [9]:
from llama_index.schema import Document

In [30]:
def clean_name(name: str):
    items = name.split("-")
    items = [item.capitalize() for item in items]
    return " ".join(items)

In [37]:
from textwrap import dedent

In [43]:
full_text = "**** AFJ Limited website information *****\n\n"
for idx in range(len(all_html_files)):
    f = all_html_files[idx]
    parent:str = str(f.parent)
    page_name = parent.split("/")[-1].capitalize()
    page_name = clean_name(page_name)
    print(f"idx {idx + 1}/{len(all_html_files)}")
    loaded_doc = reader.load_data(file=f, split_documents=True)

    start_idx = 50
    full_text += f"\n\n****AFJ Limited {page_name} information **** \n\n"
    full_text += "\n\n".join([d.get_content() for d in loaded_doc])
    

doc = Document(
    text = full_text
)
    

idx 1/32
idx 2/32
idx 3/32
idx 4/32
idx 5/32
idx 6/32
idx 7/32
idx 8/32
idx 9/32
idx 10/32
idx 11/32
idx 12/32
idx 13/32
idx 14/32
idx 15/32
idx 16/32
idx 17/32
idx 18/32
idx 19/32
idx 20/32
idx 21/32
idx 22/32
idx 23/32
idx 24/32
idx 25/32
idx 26/32
idx 27/32
idx 28/32
idx 29/32
idx 30/32
idx 31/32
idx 32/32


In [39]:
print(doc)

Doc ID: 7b1488d5-0ec9-42f7-9f8c-dace5de3f73d
Text: **** AFJ Limited website information *****    ****AFJ Limited
Feed information ****     ****AFJ Limited Blogs information ****
Established in 2006 with over 15 years of Experience  AFJ Business
Center, 2-18 Forster Street Nechells Birmingham  0121 689 1000
info@afjltd.co.uk  About UsAbout UsOur VisionOur Mission  servicesHome
to School Transp...


In [40]:
print(doc.text)

**** AFJ Limited website information *****



****AFJ Limited Feed information **** 



****AFJ Limited Blogs information **** 

Established in 2006 with over 15 years of Experience

AFJ Business Center, 2-18 Forster Street Nechells Birmingham

0121 689 1000

info@afjltd.co.uk

About UsAbout UsOur VisionOur Mission

servicesHome to School Transport ServicesPatient Transport ServiceExecutive Minibus HireAFJ Fleet Maintenance ServicesPrivate Hire Minibus With DriverAFJ Training AcademyRepatriation ServicesAmbulance Event Cover

ResourcesBlogsCommunityNews

Contact

Careers

Bookings

About UsAbout UsOur VisionOur Mission

servicesHome to School Transport ServicesPatient Transport ServiceExecutive Minibus HireAFJ Fleet Maintenance ServicesPrivate Hire Minibus With DriverAFJ Training AcademyRepatriation ServicesAmbulance Event Cover

ResourcesBlogsCommunityNews

Contact

Careers

Blogs - AFJ

Home

Blogs

21 Nov, 2023

Posted inOur Blog

CSR

BUSINESSES CHANGING LIVES THROUGH CSR Corporate

In [45]:
llm = OpenAI(model="mistralai/Mixtral-8x7B-Instruct-v0.1", temperature=0.0)
service_context = ServiceContext.from_defaults(llm=llm)

In [46]:
from llama_index.indices.vector_store import VectorStoreIndex
from llama_index.indices.list.base import SummaryIndex

### Use documents to build standard QA agent
We would be using the extracted documents to build an index for our qa system


In [56]:
index = VectorStoreIndex.from_documents(documents=[doc], service_context=service_context)

In [57]:
website_qa = index.as_query_engine(service_context=service_context)

In [68]:
response = website_qa.query("what job roles are available at AFJ limited?")

In [69]:
from pprint import pprint

In [70]:
pprint(response.response)

('Based on the information provided, AFJ Limited offers a variety of job roles '
 'related to their services. These roles might include:\n'
 '\n'
 '1. Fleet Maintenance Technicians: These professionals would be responsible '
 "for maintaining the company's light commercial vehicle (LCV) fleet.\n"
 '\n'
 '2. Fleet Management Specialists: They would manage various aspects of the '
 "company's fleet, such as fuel management, vehicle tracking, and driver "
 'safety training.\n'
 '\n'
 '3. Home to School Transport Drivers: These drivers would be responsible for '
 'transporting students to and from school.\n'
 '\n'
 '4. Patient Transport Service Drivers: They would transport patients to and '
 'from medical appointments.\n'
 '\n'
 '5. Executive Minibus Drivers: These drivers would operate minibusses for '
 'executive hire.\n'
 '\n'
 '6. AFJ Training Academy Instructors: They would train drivers and other '
 'staff members in various skills.\n'
 '\n'
 '7. Repatriation Services Staff: These p

In [61]:
response.get_formatted_sources()

'> Source (Doc id: cc614014-f856-474d-bdd3-9583e044389f): Safeguarding of Children and Vulnerable Adults: Our safeguarding courses aim to ensure that indiv...\n\n> Source (Doc id: ea2a029e-2b86-41fb-8a45-0960877a58c8): Facebook\n\nYouTube\n\nTwitter\n\nInstagram\n\nMenus\n\nHome\n\nAbout Us\n\nOur Vision\n\nOur Mission\n\nContact  U...'

#### Now lets save the index 
We will persisit the index storage context so that we do not repeat the reindexing process when we want to use the QA engine on anothe platform

In [71]:
index.storage_context.persist(persist_dir="afjlimitedweb")

In [72]:
from llama_index.storage import StorageContext
from llama_index.indices.loading import load_index_from_storage

In [73]:
storage_context = StorageContext.from_defaults(persist_dir="afjlimitedweb")
index = load_index_from_storage(storage_context)

In [74]:
query_engine = index.as_query_engine(service_context=service_context)


In [75]:
response = query_engine.query("what is the mission of AFJ limited")

In [76]:
pprint(response.response)

("AFJ Limited's mission, as deduced from the context, is to provide reliable, "
 'safe, and sustainable transport services while contributing positively to '
 'society and the environment. They prioritize ethical practices, community '
 'betterment, and environmental responsibility in their operations.')


In [13]:
import nest_asyncio

nest_asyncio.apply()

In [14]:
from llama_index.agent.openai.base import OpenAIAgent
from llama_index.indices.loading import load_index_from_storage
from llama_index.storage import StorageContext
from llama_index.tools.query_engine import QueryEngineTool
from llama_index.tools.types import ToolMetadata
from llama_index.node_parser import SentenceSplitter
import os
from tqdm import tqdm
import pickle


In [15]:
from textwrap import dedent

In [16]:
async def build_agent_per_doc(nodes, file_base):
    print(file_base)
    vi_out_path = f"./data/website/{file_base}" # output path for vector index
    summary_out_path = f"./data/website/{file_base}_summary.pkl" # output file for summary index
    
    if not os.path.exists(vi_out_path):
        Path("./data/website/").mkdir(parents=True, exist_ok=True)
        vector_index = VectorStoreIndex(nodes, service_context=service_context)
        vector_index.storage_context.persist(persist_dir=vi_out_path)
        
    else:
        vector_index = load_index_from_storage(StorageContext.from_defaults(persist_dir=vi_out_path),
                                               service_context=service_context)
    
    # build the summary index
    summary_index = SummaryIndex(nodes, service_context=service_context)
    
    vector_query_engine = vector_index.as_query_engine()
    summary_query_engine = summary_index.as_query_engine(response_mode="tree_summarize")
    
    if not os.path.exists(summary_out_path):
        Path(summary_out_path).parent.mkdir(parents=True, exist_ok=True)
        summary = str(
            await summary_query_engine.aquery(
                "Extract a concise 1-2 line summary of this document"
            )
        )
        pickle.dump(summary, open(summary_out_path, "wb"))
    else:
        summary = pickle.load(open(summary_out_path, "rb"))
        
    query_tools = [
        QueryEngineTool(query_engine=vector_query_engine,
                        metadata=ToolMetadata(
                            name=f"vector_tool_{file_base}",
                            description=f"Useful for questions related to specific facts"
                )),
         QueryEngineTool(query_engine=summary_query_engine,
                        metadata=ToolMetadata(
                            name=f"summary_tool_{file_base}",
                            description=f"Useful for summarization questions",
                            )),

    ]
    
    function_llm = OpenAI(model="mistralai/Mixtral-8x7B-Instruct-v0.1")
 
    agent = OpenAIAgent.from_llm(
        llm=function_llm,
        tools=query_tools,
        verbose=True,
        # system_prompt=dedent(f"""\
        #     You are a specialized agent designed to answer queries about the `{file_base}.html` part of the LlamaIndex docs.
        #     You must ALWAYS use at least one of the tools provided when answering a question; do NOT rely on prior knowledge.\
        #     """),
    )
    
    return agent, summary

In [17]:
async def build_agents(docs):
    node_parser = SentenceSplitter()
    
    agents_dict = {}
    extract_info_dict = {}
    
    for idx, doc in enumerate(tqdm(docs)):
        nodes = node_parser.get_nodes_from_documents([doc])
        # ID will be base + parent
        file_path = Path(doc.metadata["path"])
        file_base = str(file_path.parent.stem) + "_" + str(file_path.stem)
        agent, summary = await build_agent_per_doc(nodes, file_base)
        
        agents_dict[file_base] = agent
        extract_info_dict[file_base] = {"summary": summary, "nodes": nodes}
        
    return agents_dict, extract_info_dict

In [18]:
agents_dict, extra_info_dict = await build_agents(docs)

  0%|          | 0/32 [00:00<?, ?it/s]

100%|██████████| 32/32 [00:00<00:00, 200.78it/s]

feed_index
blogs_index
contact_index
our-story_index
about-us-afj_index
csr_index
feed_index
our-blog_index
feed_index
home-to-school_index
booking_index
patient-transport-service_index
admin_index
feed_index
fleet-maintenance_index
repatriation-services_index
private-hire-services_index
executive-minibus-hire_index
feedback_index
ambulance-event-cover_index
our-vision_index
news_index
feed_index
job-openings_index
our-mission_index
community_index
services_index
fully-loaded-cars_index
refundable-deposit_index
afj-training-school_index
well-trained-chauffeurs_index
extra-safety-via-gps_index





In [19]:
all_tools = []

for file_base, agent in agents_dict.items():
    summary = extra_info_dict[file_base]["summary"]
    doc_tool = QueryEngineTool(
        query_engine=agent, 
        metadata=ToolMetadata(
            name=f"tool_{file_base}",
            description=summary,
        )
    )
    all_tools.append(doc_tool)

In [20]:
from llama_index.objects import ObjectIndex, SimpleToolNodeMapping, ObjectRetriever
from llama_index.retrievers import BaseRetriever
from llama_index.postprocessor import SentenceTransformerRerank
from llama_index.tools import QueryPlanTool
from llama_index.query_engine import SubQuestionQueryEngine

In [21]:
llm = OpenAI(model="mistralai/Mixtral-8x7B-Instruct-v0.1")

In [22]:
tool_mapping = SimpleToolNodeMapping.from_objects(all_tools)
obj_index = ObjectIndex.from_objects(
    all_tools, 
    tool_mapping,
    VectorStoreIndex
)

vector_node_retreiver = obj_index.as_node_retriever(similarity_top_k=10)

In [26]:
# define a custom retriever with reranking
class CustomRetriever(BaseRetriever):
    def __init__(self, vector_retriever, postprocessor=None):
        self._vector_retriever = vector_retriever
        self._postprocessor = postprocessor or SentenceTransformerRerank(model="cross-encoder/ms-marco-MiniLM-L-2-v2", top_n=3)
        super().__init__()

    def _retrieve(self, query_bundle):
        retrieved_nodes = self._vector_retriever.retrieve(query_bundle)
        filtered_nodes = self._postprocessor.postprocess_nodes(
            retrieved_nodes, query_bundle=query_bundle
        )

        return filtered_nodes


# define a custom object retriever that adds in a query planning tool
class CustomObjectRetriever(ObjectRetriever):
    def __init__(self, retriever, object_node_mapping, all_tools, llm=None):
        self._retriever = retriever
        self._object_node_mapping = object_node_mapping
        self._llm = llm or OpenAI("gpt-4-0613")

    def retrieve(self, query_bundle):
        nodes = self._retriever.retrieve(query_bundle)
        tools = [self._object_node_mapping.from_node(n.node) for n in nodes]

        sub_question_engine = SubQuestionQueryEngine.from_defaults(
            query_engine_tools=tools, service_context=service_context
        )
        sub_question_description = f"""\
                Useful for any queries that involve comparing multiple documents. ALWAYS use this tool for comparison queries - make sure to call this \
                tool with the original query. Do NOT use the other tools for any queries involving multiple documents.
                """
        sub_question_tool = QueryEngineTool(
            query_engine=sub_question_engine,
            metadata=ToolMetadata(
                name="compare_tool", description=sub_question_description
            ),
        )

        return tools + [sub_question_tool]
    

In [27]:
custom_node_retriever = CustomRetriever(vector_node_retreiver)

# wrap it with ObjectRetriever to return objects
custom_obj_retriever = CustomObjectRetriever(
    custom_node_retriever, tool_mapping, all_tools, llm=llm
)

In [28]:
tmps = custom_obj_retriever.retrieve("hello")
print(len(tmps))

4


In [29]:
from llama_index.agent import FnRetrieverOpenAIAgent

top_agent = FnRetrieverOpenAIAgent.from_retriever(
    custom_obj_retriever,
    system_prompt=""" \
You are an agent designed to answer queries about the documentation.
Please always use the tools provided to answer a question. Do not rely on prior knowledge.\

""",
    llm=llm,
    verbose=True,
)

In [53]:
len(extra_info_dict)

28

In [41]:
all_nodes = [
    n for extra_info in extra_info_dict.values() for n in extra_info["nodes"]
]

In [55]:
len(all_nodes)

28

In [42]:
base_index = VectorStoreIndex(all_nodes, service_context=service_context)
base_query_engine = base_index.as_query_engine(similarity_top_k=4, llm=llm, service_context=service_context)

In [43]:
from pprint import pprint

In [46]:
response = base_query_engine.query("what service does afj limited provide?")

In [47]:
pprint(response.response)

('Based on the provided context, AFJ Limited provides services that can be '
 'explored through the following pages: job openings, our story, news, and our '
 'mission. While the specifics of these services are not detailed, one can '
 "infer that they may include job opportunities, learning about the company's "
 "background, staying updated with news, and understanding the company's "
 'mission. To get more detailed information, visiting these pages would be '
 'necessary.')


In [52]:
response = base_query_engine.query("does AFJ limited provide traning services?")
pprint(response.response)

('Based on the information provided, there is no mention of AFJ Limited '
 'offering training services. The context includes pages about feedback, '
 'refundable deposit, booking, and the community, but none of them indicate '
 'the provision of training services.')


In [51]:
response = top_agent.query("what services does the company provide?")

STARTING TURN 1
---------------

=== Calling Function ===
Calling function: tool_about-us-afj_index with args: {"input":"services"}
Added user message to memory: services
=== Calling Function ===
Calling function: vector_tool_about-us-afj_index with args: {"input":"services"}
Got output: I cannot directly refer to the context information, but based on the webpage's path provided, it seems that the company offers various services related to the "about-us-afj" section of their website. To get detailed information about the services they provide, you can visit the website and explore the relevant sections.

=== Calling Function ===
Calling function: vector_tool_about-us-afj_index with args: {"input":"services"}
Got output: I cannot directly refer to the context information, but based on the webpage's path provided, it seems that the company offers various services related to the "about-us-afj" section of their website. To get detailed information about the services they provide, you can v

KeyboardInterrupt: 

In [49]:
pprint(response.response)

('It seems like you are trying to call a function from the toolbox, but you '
 "haven't provided the necessary arguments. In your code, you need to pass the "
 '`ai_message` object with the `additional_kwargs` field containing the '
 '`tool_calls` list.\n'
 '\n'
 "Here's an example of how you can modify your code:\n"
 '\n'
 '```python\n'
 'ai_message = {\n'
 '    "role": "user",\n'
 '    "content": "I need to compare the features of two smartphones, one from '
 'Apple and one from Samsung."\n'
 '}\n'
 '\n'
 'tool_calls = [\n'
 '    {\n'
 '        "name": "compare_tool",\n'
 '        "arguments": {\n'
 '            "input": ai_message["content"]\n'
 '        }\n'
 '    }\n'
 ']\n'
 '\n'
 'ai_message["additional_kwargs"] = {\n'
 '    "tool_calls": tool_calls\n'
 '}\n'
 '\n'
 '# Now you can call the function\n'
 'function_name = ai_message["function_name"]\n'
 'function_arguments = ai_message["arguments"]\n'
 '\n'
 'result = functions[function_name](**function_arguments)\n'
 'print(result