# Custom Knowledge Chatbot w/ LlamaIndex


Examples:
- https://gita.kishans.in/
- https://www.chatpdf.com/

In [1]:
!pip install llama_index
!pip install langchain

Collecting llama_index
  Downloading llama_index-0.5.12.tar.gz (174 kB)
     ------------------------------------- 174.8/174.8 kB 10.3 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting dataclasses_json
  Downloading dataclasses_json-0.5.7-py3-none-any.whl (25 kB)
Collecting langchain>=0.0.123
  Downloading langchain-0.0.137-py3-none-any.whl (518 kB)
     ------------------------------------- 518.3/518.3 kB 16.4 MB/s eta 0:00:00
Collecting tiktoken
  Downloading tiktoken-0.3.3-cp39-cp39-win_amd64.whl (579 kB)
     ------------------------------------- 579.8/579.8 kB 18.4 MB/s eta 0:00:00
Collecting pydantic<2,>=1
  Downloading pydantic-1.10.7-cp39-cp39-win_amd64.whl (2.2 MB)
     ---------------------------------------- 2.2/2.2 MB 46.2 MB/s eta 0:00:00
Collecting aiohttp<4.0.0,>=3.8.3
  Downloading aiohttp-3.8.4-cp39-cp39-win_amd64.whl (323 kB)
     ------------------------------------- 323.6/323.6 kB 19.6 M

# Basic LlamaIndex Usage Pattern

In [31]:
import os
openai_key='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'
os.environ['OPENAI_API_KEY'] =openai_key

In [13]:
# Load you data into 'Documents' a custom type by LlamaIndex

from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader('./data').load_data()

In [14]:
# Create an index of your documents

from llama_index import GPTSimpleVectorIndex

index = GPTSimpleVectorIndex.from_documents(documents)



INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 1321 tokens


In [33]:
# Save your index to a index.json file
index.save_to_disk('index.json')
# Load the index from your saved index.json file
index = GPTSimpleVectorIndex.load_from_disk('index.json')

In [15]:
# Query your index!

response = index.query("What do you think of Facebook's LLaMa?")
print(response)

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 1448 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 11 tokens



I think Facebook's LLaMa is a great step forward in democratizing access to large language models and advancing research in this subfield of AI. It is encouraging to see that they are making the model available at several sizes and providing a model card to detail how it was built in accordance with responsible AI practices. I am also glad to see that they are releasing the model under a noncommercial license to ensure integrity and prevent misuse.


# Customize your LLM for different output

In [16]:
import openai
from langchain import OpenAI

In [32]:
# Setup your LLM

from llama_index import LLMPredictor, GPTSimpleVectorIndex, PromptHelper


# define LLM
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0.1, model_name="Ada"))

# define prompt helper
# set maximum input size
max_input_size = 4096
# set number of output tokens
num_output = 256
# set maximum chunk overlap
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)

custom_LLM_index = GPTSimpleVectorIndex.from_documents(
    documents, 
    llm_predictor=llm_predictor,
     # prompt_helper=prompt_helper
)

TypeError: __init__() got an unexpected keyword argument 'llm_predictor'

In [25]:
# Query your index!

response = custom_LLM_index.query("challenges of LLaMa?")
print(response)

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 1404 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 8 tokens



The challenges of LLaMa include bias, toxicity, and the potential for generating misinformation. Additionally, there is still more research that needs to be done to address the risks of bias, toxic comments, and hallucinations in large language models.


# Wikipedia Example

In [162]:
from llama_index import download_loader

WikipediaReader = download_loader("WikipediaReader")

loader = WikipediaReader()
wikidocs = loader.load_data(pages=['Cyclone Freddy'])

# https://en.wikipedia.org/wiki/Cyclone_Freddy

In [163]:
wiki_index = GPTSimpleVectorIndex(wikidocs)

INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 4103 tokens


In [165]:
response = wiki_index.query("What is cyclone freddy?")
print(response)

INFO:root:> [query] Total LLM token usage: 3844 tokens
INFO:root:> [query] Total embedding token usage: 8 tokens






# Customer Support Example

In [26]:
documents = SimpleDirectoryReader('./asos').load_data()

In [27]:

custom_LLM_index = GPTSimpleVectorIndex.from_documents(
    documents, 
    #llm_predictor=llm_predictor,
     # prompt_helper=prompt_helper
)

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 0 tokens


In [151]:
index = GPTSimpleVectorIndex(documents)
response = index.query("What premier service options do I have in the UAE?")
print(response)

INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 12584 tokens


In [153]:
response = index.query("What premier service options do I have in the UAE?")
print(response)

INFO:root:> [query] Total LLM token usage: 1317 tokens
INFO:root:> [query] Total embedding token usage: 11 tokens



In the United Arab Emirates, you have the option of signing up for ASOS Premier, which gives you free Standard and Express delivery all year round when you spend over 150 AED. It costs 200 AED and is valid on the order you purchase it on.


# YouTube Video Example

In [154]:
YoutubeTranscriptReader = download_loader("YoutubeTranscriptReader")

loader = YoutubeTranscriptReader()
documents = loader.load_data(ytlinks=['https://www.youtube.com/watch?v=K7Kh9Ntd8VE&ab_channel=DaveNick'])

In [159]:
index = GPTSimpleVectorIndex(documents)

INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 18181 tokens


In [157]:
response = index.query("What some YouTube automation mistakes to avoid?")
print(response)

INFO:root:> [query] Total LLM token usage: 4024 tokens
INFO:root:> [query] Total embedding token usage: 8 tokens




1. Re-uploading other people's content without permission.
2. Using copyrighted music.
3. Not understanding how the YouTube algorithm works.
4. Not researching the best niche for YouTube automation.
5. Not optimizing the About section with relevant keywords.
6. Not creating a logo and channel art that is professional and attractive.


# Chatbot Class - Just include your index

In [28]:
import openai
import json

class Chatbot:
    def __init__(self, api_key, index):
        self.index = index
        openai.api_key = api_key
        self.chat_history = []

    def generate_response(self, user_input):
        prompt = "\n".join([f"{message['role']}: {message['content']}" for message in self.chat_history[-5:]])
        prompt += f"\nUser: {user_input}"
        response = index.query(user_input)

        message = {"role": "assistant", "content": response.response}
        self.chat_history.append({"role": "user", "content": user_input})
        self.chat_history.append(message)
        return message
    
    def load_chat_history(self, filename):
        try:
            with open(filename, 'r') as f:
                self.chat_history = json.load(f)
        except FileNotFoundError:
            pass

    def save_chat_history(self, filename):
        with open(filename, 'w') as f:
            json.dump(self.chat_history, f)


In [None]:
documents = SimpleDirectoryReader('./data').load_data()
index = GPTSimpleVectorIndex(documents)

In [30]:
# Swap out your index below for whatever knowledge base you want
bot = Chatbot(openai_key, index=index)
bot.load_chat_history("chat_history.json")

while True:
    user_input = input("You: ")
    if user_input.lower() in ["bye", "goodbye"]:
        print("Bot: Goodbye!")
        bot.save_chat_history("chat_history.json")
        break
    response = bot.generate_response(user_input)
    print(f"Bot: {response['content']}")

INFO:openai:error_code=None error_message="[''] is not valid under any of the given schemas - 'input'" error_param=None error_type=invalid_request_error message='OpenAI API error received' stream_error=False
INFO:openai:error_code=None error_message="[''] is not valid under any of the given schemas - 'input'" error_param=None error_type=invalid_request_error message='OpenAI API error received' stream_error=False
INFO:openai:error_code=None error_message="[''] is not valid under any of the given schemas - 'input'" error_param=None error_type=invalid_request_error message='OpenAI API error received' stream_error=False
INFO:openai:error_code=None error_message="[''] is not valid under any of the given schemas - 'input'" error_param=None error_type=invalid_request_error message='OpenAI API error received' stream_error=False
INFO:openai:error_code=None error_message="[''] is not valid under any of the given schemas - 'input'" error_param=None error_type=invalid_request_error message='OpenAI

In [None]:
# Pandas Dataframe Agent

In [34]:
import numpy as np
import pandas as pd

In [35]:
data2=pd.read_csv('https://raw.githubusercontent.com/XUAN-24601/Korea_Spatial/main/Data/od_index_transit_sample.csv')#.sample(1000)

data=data2[~data2.isin([np.nan, np.inf, -np.inf]).any(1)]

  data=data2[~data2.isin([np.nan, np.inf, -np.inf]).any(1)]


In [37]:
data.columns

Index(['Unnamed: 0', 'ADM_CD_O', 'ADM_CD_D', 'flux_2020', 'flux_2021',
       'flux_2022', 'flux_2023', 'flux_%20-21', 'flux_%20-22', 'flux_%20-23',
       'Resilience_%20-21', 'Resilience_%20-22', 'Resilience_%20-23',
       'Mean_time', 'Line_OD', 'Distance', 'mode', 'Densi_bus_count_O',
       'Densi_train_O', 'I-Consu&Serv_O', 'I-PubAdmin_O',
       'I-Health&SocialWork_O', 'I-Manuf&Stor_O', 'I-Fina&Tech_O',
       'University_O', 'Local leisure_O', 'Densi_Tourism_O', 'poi_count_O',
       'Entropy_O', 'Car_own_O', 'ReDensity_O', '%foreigner_O',
       '%female_employee_O', 'Densi_bus_count_D', 'Densi_train_D',
       'I-Consu&Serv_D', 'I-PubAdmin_D', 'I-Health&SocialWork_D',
       'I-Manuf&Stor_D', 'I-Fina&Tech_D', 'University_D', 'Local leisure_D',
       'Densi_Tourism_D', 'poi_count_D', 'Entropy_D', 'Car_own_D',
       'ReDensity_D', '%foreigner_D', '%female_employee_D'],
      dtype='object')

In [36]:
from langchain.agents import create_pandas_dataframe_agent
from langchain.llms import OpenAI




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to count the rows
Action: python_repl_ast
Action Input: len(df)[0m
Observation: [36;1m[1;3m21906[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: There are 21906 rows in the dataframe.[0m

[1m> Finished chain.[0m


'There are 21906 rows in the dataframe.'

In [42]:
agent = create_pandas_dataframe_agent(OpenAI(temperature=0), data, verbose=True)
agent.run("what's the max ridership in 2023?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to find the maximum value in the flux_2023 column
Action: python_repl_ast
Action Input: df['flux_2023'].max()[0m
Observation: [36;1m[1;3m8210.4[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: 8210.4[0m

[1m> Finished chain.[0m


'8210.4'

In [41]:
agent = create_pandas_dataframe_agent(OpenAI(temperature=0), data, verbose=True)
agent.run("find the corelation between the ridership in 2023 and Car ownership at Origins (O) ")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to find the correlation between two columns
Action: python_repl_ast
Action Input: df[['flux_2023', 'Car_own_O']].corr()[0m
Observation: [36;1m[1;3m           flux_2023  Car_own_O
flux_2023   1.000000  -0.027715
Car_own_O  -0.027715   1.000000[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: The correlation between ridership in 2023 and Car ownership at Origins is -0.027715.[0m

[1m> Finished chain.[0m


'The correlation between ridership in 2023 and Car ownership at Origins is -0.027715.'