### Experimenting with Discord
In this document we will be exploring managing conversation information from discord using llama index


In [1]:
import json
import sys

In [2]:
class Message:
    def __init__(self, message_id, message_text, author, timestamp, parent_message=None, child_message=None):
        self.message_id = message_id
        self.message_text = message_text
        self.author = author
        self.timestamp = timestamp
        self.parent_message = parent_message
        self.child_message = child_message
        
    def set_child(self, message):
        self.child_message = message
        
    def set_parent(self, message):
        self.parent_message = message

In [3]:
def readFile(filename):
    data = None
    with open(filename, "r") as f:
        data = json.load(f)
    return data

In [4]:
from typing import Dict

In [31]:
def writeToConvDocs(filename):
    data = readFile(filename)
    
    messages: Dict[int, Message] = {}
    for message in data["messages"]:
        _id = message["id"]
        text =  message["content"]
        message_type = message["type"]
        author = message["author"]["name"]
        timestamp = message["timestamp"]
        
        if message_type in ("ThreadCreated", "ChannelPinnedMessage"):
            continue
        
        messages[_id] = Message(_id, text, author, timestamp)
        if message_type == "Reply":
            parent_id = message["reference"]["messageId"]
            try:
                messages[_id].set_parent(messages[parent_id])
            except:
                continue
            messages[parent_id].set_child(messages[_id])
            
    convo_docs = []
    for msg in messages.values():
        if not msg.parent_message:
            metadata = {
                "timestamp": msg.timestamp,
                "id": msg.message_id,
                "author": msg.author
            }
            convo = ""
            convo += msg.author + ":\n"
            convo += msg.message_text + "\n"
            
            curr_msg:Message = msg
            is_thread = False
            while curr_msg.child_message is not None:
                is_thread = True
                curr_msg = curr_msg.child_message
                convo += curr_msg.author + ":\n"
                convo += curr_msg.message_text + "\n"

            if is_thread:
                convo_docs.append({"thread": convo, "metadata": metadata})
                
    with open("converation_docs.json", "w") as f:
        json.dump(convo_docs, f)
    
    print("Written information to convo docs")
    return
                
            

# Discussion
The point of this document is to explore how to handle data from ever expanding datasources. We are using an ever updating discord channel information to create this

In [7]:
import os
print(os.listdir("data"))

['help_channel_dump_06_02_23.json', 'help_channel_dump_05_25_23.json']


In [15]:
from typing import Any

In [16]:
with open("./data/help_channel_dump_05_25_23.json", "r") as f:
    data:Dict[Any, Any] = json.load(f)

In [12]:
type(data)

dict

In [17]:
data.keys()

dict_keys(['guild', 'channel', 'dateRange', 'messages', 'messageCount'])

In [19]:
print("the number of messages are ", len(data["messages"]))

the number of messages are  5087


In [20]:
print("Sample Message Keys: ", data["messages"][0].keys(), "\n")
print("First Message: ", data["messages"][0]["content"], "\n")
print("Last Message: ", data["messages"][-1]["content"])

Sample Message Keys:  dict_keys(['id', 'type', 'timestamp', 'timestampEdited', 'callEndedTimestamp', 'isPinned', 'content', 'author', 'attachments', 'embeds', 'stickers', 'reactions', 'mentions']) 

First Message:  If you're running into any bugs, issues, or you have questions as to how to best use GPT Index, put those here! 
- If it's a bug, let's also track as a GH issue: https://github.com/jerryjliu/gpt_index/issues. 

Last Message:  Hello there! How can I use llama_index with GPU?


In [32]:
writeToConvDocs("./data/help_channel_dump_05_25_23.json")

Written information to convo docs


In [36]:
from typing import List

In [37]:
with open("converation_docs.json", "r") as f:
    threads:List[Dict[Any, Any]] = json.load(f)

In [38]:
threads[0].keys()

dict_keys(['thread', 'metadata'])

In [39]:
print(threads[0]["metadata"], "\n")
print(threads[0]["thread"], "\n")

{'timestamp': '2023-01-02T03:36:04.191+00:00', 'id': '1059314106907242566', 'author': 'arminta7'} 

arminta7:
Hello all! Thanks to GPT_Index I've managed to put together a script that queries my extensive personal note collection which is a local directory of about 20k markdown files. Some of which are very long. I work in this folder all day everyday, so there are frequent changes. Currently I would need to rerun the entire indexing (is that the correct term?) when I want to incorporate edits I've made. 

So my question is... is there a way to schedule indexing to maybe once per day and only add information for files that have changed? Or even just manually run it but still only add edits? This would make a huge difference in saving time (I have to leave it running overnight for the entire directory) as well as cost 😬. 

Excuse me if this is a dumb question, I'm not a programmer and am sort of muddling around figuring this out 🤓 

Thank you for making this sort of project accessible t

In [41]:
threads[0].keys()

dict_keys(['thread', 'metadata'])

In [42]:
threads[0]["metadata"].keys()

dict_keys(['timestamp', 'id', 'author'])

### Now its time for the fun part: Create the index
We will use this extracted data to create the index for our model

In [40]:
from llama_index.schema import Document

In [44]:
documents: List[Document] = []
for thread in threads:
    text = thread["thread"]
    _id = thread["metadata"]["id"]
    author = thread["metadata"]["author"]
    timestamp = thread["metadata"]["timestamp"]
    metadata = thread["metadata"]
    documents.append(Document(id_=_id, text=text, metadata={"date": timestamp}))

In [45]:
from llama_index.indices.vector_store import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

In [46]:
print("ref_docs ingested: ", len(index.ref_doc_info))
print("number of input documents: ", len(documents))

ref_docs ingested:  767
number of input documents:  767


In [47]:
thread_id = threads[0]["metadata"]["id"]
print(index.ref_doc_info[thread_id])

RefDocInfo(node_ids=['ad0b9bf8-040e-405f-8e9e-e48d1101a9c4'], metadata={'date': '2023-01-02T03:36:04.191+00:00'})


In [48]:
# lets persist the index so we do not have to reindex to save const
index.storage_context.persist()

In [49]:
# lets reload the index from the saved storage context
from llama_index.storage import StorageContext
from llama_index.indices import load_index_from_storage

In [50]:
storage_context = StorageContext.from_defaults(persist_dir="storage")

In [51]:
index = load_index_from_storage(storage_context)

#### Refreshing the index with new data
If we are working with dynamically changing data, we would want a situation where we can automatically update our index without having to rebuild it, because that would by resource (time, money) consuming.

In [53]:
with open("./data/help_channel_dump_06_02_23.json", "r") as f:
    data = json.load(f)
print("JSON keys: ", data.keys(), "\n")
print("Message Count: ", len(data["messages"]), "\n")
print("Sample Message Keys: ", data["messages"][0].keys(), "\n")
print("First Message: ", data["messages"][0]["content"], "\n")
print("Last Message: ", data["messages"][-1]["content"])

JSON keys:  dict_keys(['guild', 'channel', 'dateRange', 'messages', 'messageCount']) 

Message Count:  5286 

Sample Message Keys:  dict_keys(['id', 'type', 'timestamp', 'timestampEdited', 'callEndedTimestamp', 'isPinned', 'content', 'author', 'attachments', 'embeds', 'stickers', 'reactions', 'mentions']) 

First Message:  If you're running into any bugs, issues, or you have questions as to how to best use GPT Index, put those here! 
- If it's a bug, let's also track as a GH issue: https://github.com/jerryjliu/gpt_index/issues. 

Last Message:  Started a thread.


In [54]:
writeToConvDocs("./data/help_channel_dump_06_02_23.json")

Written information to convo docs


In [56]:
with open("converation_docs.json", "r") as f:
    threads = json.load(f)
print("Thread keys: ", threads[0].keys(), "\n")
print(threads[0]["metadata"], "\n")
print(threads[0]["thread"], "\n")

Thread keys:  dict_keys(['thread', 'metadata']) 

{'timestamp': '2023-01-02T03:36:04.191+00:00', 'id': '1059314106907242566', 'author': 'arminta7'} 

arminta7:
Hello all! Thanks to GPT_Index I've managed to put together a script that queries my extensive personal note collection which is a local directory of about 20k markdown files. Some of which are very long. I work in this folder all day everyday, so there are frequent changes. Currently I would need to rerun the entire indexing (is that the correct term?) when I want to incorporate edits I've made. 

So my question is... is there a way to schedule indexing to maybe once per day and only add information for files that have changed? Or even just manually run it but still only add edits? This would make a huge difference in saving time (I have to leave it running overnight for the entire directory) as well as cost 😬. 

Excuse me if this is a dumb question, I'm not a programmer and am sort of muddling around figuring this out 🤓 

Than

In [57]:
new_documents = []

for thread in threads:
    _id = thread["metadata"]["id"]
    text = thread["thread"]
    timestamp =  thread["metadata"]["timestamp"]
    metadata = thread["metadata"]
    
    new_documents.append(Document(id_=_id, text=text, metadata={"date": timestamp}))

In [58]:
print("Number of new documents: ", len(new_documents) - len(documents))

Number of new documents:  13


In [60]:
refreshed_docs = index.refresh(
    new_documents,
)

In [61]:
print("Number of newly inserted/refreshed docs: ", sum(refreshed_docs))

Number of newly inserted/refreshed docs:  17


In [66]:
print("The total number of documents is ", len(refreshed_docs))

The total number of documents is  780


In [69]:
print(refreshed_docs[-14:])

[False, True, True, True, True, True, True, True, True, True, True, True, True, True]


In [71]:
from pprint import pprint

In [72]:
pprint(new_documents[-21])

Document(id_='1110938122902048809', embedding=None, metadata={'date': '2023-05-24T14:31:28.732+00:00'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Siddhant Saurabh:\nhey facing error\n```\n*error_trace: Traceback (most recent call last):\n File "/app/src/chatbot/query_gpt.py", line 248, in get_answer\n   context_answer = self.call_pinecone_index(request)\n File "/app/src/chatbot/query_gpt.py", line 229, in call_pinecone_index\n   self.source.append(format_cited_source(source_node.doc_id))\n File "/usr/local/lib/python3.8/site-packages/llama_index/data_structs/node.py", line 172, in doc_id\n   return self.node.ref_doc_id\n File "/usr/local/lib/python3.8/site-packages/llama_index/data_structs/node.py", line 87, in ref_doc_id\n   return self.relationships.get(DocumentRelationship.SOURCE, None)\nAttributeError: \'Field\' object has no attribute \'get\'\n```\nwith latest llama_index 0.6.9\n@Logan M @jerryjliu98 @ravitheja\nLogan M:\nHow are you i

In [73]:
documents[-8]

Document(id_='1110938122902048809', embedding=None, metadata={'date': '2023-05-24T14:31:28.732+00:00'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Siddhant Saurabh:\nhey facing error\n```\n*error_trace: Traceback (most recent call last):\n File "/app/src/chatbot/query_gpt.py", line 248, in get_answer\n   context_answer = self.call_pinecone_index(request)\n File "/app/src/chatbot/query_gpt.py", line 229, in call_pinecone_index\n   self.source.append(format_cited_source(source_node.doc_id))\n File "/usr/local/lib/python3.8/site-packages/llama_index/data_structs/node.py", line 172, in doc_id\n   return self.node.ref_doc_id\n File "/usr/local/lib/python3.8/site-packages/llama_index/data_structs/node.py", line 87, in ref_doc_id\n   return self.relationships.get(DocumentRelationship.SOURCE, None)\nAttributeError: \'Field\' object has no attribute \'get\'\n```\nwith latest llama_index 0.6.9\n@Logan M @jerryjliu98 @ravitheja\nLogan M:\nHow are you i