# Langchain Demo with telegram data

This notebook demonstrates the usage of langchain integrated with chromadb. 

The telegram extracted messages are embedded using openAI. The embeddings are saved to chromadb locally and then used for semantic search and RAG.

In [177]:
import os
import pandas as pd
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv()) # read local .env file

from langchain.llms import OpenAIChat
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.document_loaders import DataFrameLoader
from langchain.vectorstores import Chroma
from chromadb.config import Settings

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import DataFrameLoader
from langchain.schema.output_parser import StrOutputParser
from IPython.display import display, Markdown
from telethon.sync import TelegramClient
from telethon.tl.types import InputMessagesFilterEmpty
import pandas as pd

## Data loading and preprocessing

Load telegram data for a specific channel.

In [142]:
api_id = os.environ["TELEGRAM_API_ID"]
api_hash = os.environ["TELEGRAM_API_HASH"]
pinecone_key = os.environ["PINECONE_APIKEY"]
phone = os.environ["TELEGRAM_PHONE"]
username = os.environ["TELEGRAM_USERNAME"]
messages = []

channel_id = "singularitynet"

pd_data = []

columns = ["channel_name", "id", "peer_id", "date", "message", "out", "mentioned",
        "media_unread", "silent", "post", "from_scheduled", "legacy", 
        "edit_hide", "pinned","noforwards", "from_id", "fwd_from", "via_bot_id",
        "reply_to", "media", "reply_markup", "entities", "views",
        "forwards", "replies", "edit_date", "post_author", "grouped_id",
        "reactions", "restriction_reason", "ttl_period"]

client = TelegramClient(f"../sessions_data/{phone}", api_id, api_hash)
channel_id = "singularitynet"
n = 20000

async with client:        
    async for msg in client.iter_messages(channel_id, filter=InputMessagesFilterEmpty, limit=n):
        try:
            pd_data.append((channel_id, msg.id, msg.peer_id, msg.date, msg.message,
                    msg.out, msg.mentioned, msg.media_unread, msg.silent,msg.post,
                    msg.from_scheduled, msg.legacy, msg.edit_hide, msg.pinned, msg.noforwards,
                    msg.from_id.user_id if hasattr(msg.from_id, "user_id") else msg.from_id.channel_id, msg.fwd_from, msg.via_bot_id, msg.reply_to, msg.media, msg.reply_markup,
                    msg.entities, msg.views, msg.forwards, msg.replies, msg.edit_date, msg.post_author,
                    msg.grouped_id, msg.reactions, msg.restriction_reason, msg.ttl_period
            ))
        except Exception as e:
            print(msg.from_id)
            break

df = pd.DataFrame(pd_data, columns=columns)
df = df[df['message'] != ''] # remove empty messages
df = df[~df["message"].isna()] # remove nan text
df = df.sort_values(by="date", ascending=False)
df = df.set_index(["channel_name", "id"])

In [143]:
df.head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,peer_id,date,message,out,mentioned,media_unread,silent,post,from_scheduled,legacy,...,entities,views,forwards,replies,edit_date,post_author,grouped_id,reactions,restriction_reason,ttl_period
channel_name,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
singularitynet,1007934,PeerChannel(channel_id=1140090881),2024-04-29 08:20:04+00:00,"‍Welcome Do,\nJoin us on the journey towards A...",False,False,False,True,False,False,False,...,"[MessageEntityTextUrl(offset=0, length=1, url=...",,,"MessageReplies(replies=0, replies_pts=1423094,...",NaT,,,,,
singularitynet,1007933,PeerChannel(channel_id=1140090881),2024-04-29 07:55:54+00:00,"No, that wont be possible",False,False,False,False,False,False,False,...,,,,,NaT,,,,,


In [144]:
print("Total messages:", len(df))
max_date = df['date'].max()
min_date = df['date'].min()
print("Maximum Date:", max_date)
print("Minimum Date:", min_date)

Total messages: 19559
Maximum Date: 2024-04-29 08:20:04+00:00
Minimum Date: 2024-03-02 20:12:11+00:00


Next, for each message in the dataset, create a historical conversation that chains each of the previous replies.

Having conversations instead of single messages may provide better context when for LLMS when asking questions about the project.

In [146]:
import re
import ast
import numpy as np

def extract_reply_id(val):
    """ search for the matching id
    """
    if isinstance(val, str):
        match = re.search(r'reply_to_msg_id=(\d+)', val)
        if match:
            return int(match.group(1))
    else:
        return None

def compute_message_historical(df):
    # compute message history for each message if available

    df_extended = df.copy()
    df_extended["in_history"] = False
    df_extended["reply_to_msg_id"] = df_extended["reply_to"].apply(lambda x: int(x.reply_to_msg_id) if x is not None else None)

    # Only use if Class was parsed from text
    # df['reply_to_msg_id'] = df['reply_to'].apply(extract_reply_id)
    
    for (channel_name, message_id), row in df_extended.iterrows():
        history = [f"User_{row['from_id']}: {row['message']}"] # Initialize historical with current message
        reply_id = row["reply_to_msg_id"]
        
        try:
            while not np.isnan(reply_id) and (channel_name, reply_id) in df_extended.index:
                # Get reply row
                reply_row = df_extended.loc[(channel_name, reply_id)]

                # Update history
                history.append(f"User_{reply_row['from_id']}: {reply_row['message']}")

                # # Delete message appended to history
                # df_extended = df_extended.drop((channel_name, reply_id))
                df_extended.loc[(channel_name, reply_id), "included"] = True 

                # assign next reply id
                reply_id = reply_row["reply_to_msg_id"]


        except Exception as e:
            print(type(reply_id))
            print("something failed", e)

        df_extended.loc[(channel_name, message_id), "history"] = str(history[::-1])

        # Ignore already iterated replies

    df_extended["history"] = df_extended["history"].apply(ast.literal_eval)
    df_extended["history_str"] = df_extended["history"].apply(lambda x: "- " + "\n- ".join(x))
    df_extended["thread_length"] = df_extended["history"].str.len()
    return df_extended

df_plus = compute_message_historical(df)
df_plus = df_plus[df_plus["included"].isna()]
df_plus.to_csv(f"data/{channel_id}_replies.csv", index=False)

In [157]:
df = pd.read_csv(f"data/{channel_id}_replies.csv")
              
# Use langchain wrapper to load dataframe              
df_loader = DataFrameLoader(df, page_content_column="history_str")
docs = df_loader.load()
df["history_str"].iloc[0:4]

0    - User_210944655: ‍Welcome Do,\nJoin us on the...
1    - User_578573938: waiting to see if agix stake...
3    - User_6273725083: Hi Guys! Can I please ask s...
Name: history_str, dtype: object

In [158]:
df.shape

(11632, 35)

## Local chroma db for text embeddings

In [159]:
import chromadb

persistent_client = chromadb.PersistentClient()

CLEAR_COLLECTION = True

if CLEAR_COLLECTION:
    try:
        persistent_client.delete_collection(f"openai_embeddings_{channel_id}")
        print("deleted collection for", channel_id)
    except Exception as e:
        print("unable to delete ", e)

# How to create a client with reset allowed
# client = chromadb.HttpClient(settings=Settings(allow_reset=True))
# client.reset()  # resets the database

unable to delete  Collection openai_embeddings_singularitynet does not exist.


With our loaded collection of telegram messages, it is time to create embeddings using openAI. Chroma db offers us a simple way to achieve this.

Beware! This code performs several request to the openai API in order to create embeddings for each of each message in `docs`. Depending on the size of your dataset, this could incurre in high costs.

In [160]:
openai_embeddings_function = OpenAIEmbeddings()

openai_chroma_client = Chroma.from_documents(
    docs, 
    openai_embeddings_function, 
    client=persistent_client, 
    collection_name = f"openai_embeddings_{channel_id}"
)

## Semantic search

For simple semantic search, a given a user query is embedded and compared to the most similar items in the local db. This returns the most relevant items of our search.

In [171]:
# helper function to perform similarity search
def search_db(db_client, query: str, top_k = 100):
    docs = db_client.similarity_search(query, k=top_k)

    # print results
    for i, item in enumerate(docs[:5]):
        print("\nThread", i)
        print(f"\n{item.page_content}")

The embeddings are saved locally. Now we can perform search queries by obtaining the most similar documents to our query in order of relevance:

In [173]:
query = "buy token"
search_db(openai_chroma_client, query)


Thread 0

- User_6962443130: Where i can buy this token
- User_6859650518: you can buy on the site

Thread 1

- User_445927377: When we can buy token.

Thread 2

- User_1836130981: We want to do a deal you have your token fet have their token you need a way to trade

Thread 3

- User_1174250068: Where do you have to exchange the tokens?

Thread 4

- User_6962443130: Where i can buy this token
- User_6859650518: pm i will put you through with screenshot


## Retrieval augmented generation

For Retrieval augmented generation we make use of Langchain. We pass it the chromadb collection with the embeddings and an user query. Using openAI chat model in the background, it will answer the user query based on the top most relevant results found.

In [179]:
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(), 
    chain_type="stuff", 
    retriever=openai_chroma_client.as_retriever(search_type="mmr", search_kwargs={'fetch_k': 30}), 
    return_source_documents=True
)

In [175]:
query = "What are some use cases of singularity net?"
response = qa({"query":query})

In [180]:
response

{'query': 'What are some use cases of singularity net?',
 'result': 'SingularityNet aims to provide a decentralized marketplace for AI services and algorithms. Some potential use cases include healthcare diagnostics, financial analysis, autonomous vehicles, personalized recommendations, and more. The platform enables developers to create and monetize AI services, opening up possibilities for various industries to leverage AI technology.',
 'source_documents': [Document(page_content='- User_1922071887: SingularityNet brings a fleet of partner projects and close tie-ins that are being architected to work seamlessly together and create synergies', metadata={'date': '2024-03-29 18:58:25+00:00', 'edit_date': '2024-03-29 18:59:12+00:00', 'edit_hide': True, 'from_id': 1922071887, 'from_scheduled': False, 'history': "['User_1922071887: SingularityNet brings a fleet of partner projects and close tie-ins that are being architected to work seamlessly together and create synergies']", 'in_history'

In [185]:
import textwrap

# Wrap the text to a specific width
wrapped_text = textwrap.fill(response["result"], width=100)

# Print the wrapped text
print(wrapped_text)


SingularityNet aims to provide a decentralized marketplace for AI services and algorithms. Some
potential use cases include healthcare diagnostics, financial analysis, autonomous vehicles,
personalized recommendations, and more. The platform enables developers to create and monetize AI
services, opening up possibilities for various industries to leverage AI technology.


Note that the previous response was not prompted specifically to answer questions based on the provided context. So the LLM is making up the answer based on its knwoledge of the world and not on the information provided.

### Reusing persisted chroma collection

To rerun the previous code without having to compute embeddings again, we can access the previously saved local chromadb collection

In [186]:
# check if the existing collection has documents
collection_name = f"openai_embeddings_{channel_id}"

persistent_client = chromadb.PersistentClient()
collection = persistent_client.get_collection(collection_name)
collection.count()

11632

In [188]:
# load embeddings from disk

openai_embeddings_function = OpenAIEmbeddings()

openai_client = Chroma(
    persist_directory="./chroma", 
    collection_name=collection_name, 
    embedding_function=openai_embeddings_function
)

Check that document embeddings were properly loaded by doing a simply similarity search.

In [190]:
query = "FUD"
search_db(openai_chroma_client, query)


Thread 0

- User_339022103: FUD?

Thread 1

- User_339022103: Your FUD is also rushed
- User_551595722: No fud. A pretty direct recommendation to vote against
- User_6093679747: This failing is the biggest fud imaginable. FUD hopes to wreck bags. A no vote is guaranteed to.

Thread 2

- User_417019065: More fudders than ppl in this room
- User_1084633296: Definitions are dying in here 😂now: FUD = doesn't agree with what I think
- User_1990833629: Soon we will call it "treason" and send people to a digital gulag xD

Thread 3

- User_417019065: More fudders than ppl in this room
- User_1084633296: Definitions are dying in here 😂now: FUD = doesn't agree with what I think
- User_417019065: Honestly a lot of ppl doesnt know what they are talking seems like they are sheeps lol
- User_1990833629: "Sheep" goes both ways
- User_417019065: Keep crying
- User_1990833629: Why do you come here just to troll?

Thread 4

- User_1922071887: all fair questions
- User_5977890525: If you ask these quest

### Retrieval with context

In this example, we provide an additional prompt to the RetrievalQA Chain. In this prompt, it is indicated that the user query should be answered based only on the given context. The chain is also configured to output every step of its process.

In [191]:
# Prompt the chat model to answer the user query given a context.

template= """
The context given are several messages retrieved from a chat about a blockchain project. Based on those messages try to answer the user question.
----------------
context: {context}
user question: {question}
"""

In [192]:
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(), 
    chain_type="stuff", 
    retriever=openai_client.as_retriever(search_type="mmr", search_kwargs={'k': 20}),
    return_source_documents=True,
    verbose=True,
    chain_type_kwargs={
        "verbose":True,
        "prompt": PromptTemplate(
            template=template,
            input_variables=["context", "question"],
        ),
    }
)

let´s explore the prompt structure

In [193]:
print(qa.combine_documents_chain.llm_chain.prompt)

input_variables=['context', 'question'] template='\nThe context given are several messages retrieved from a chat about a blockchain project. Based on those messages try to answer the user question.\n----------------\ncontext: {context}\nuser question: {question}\n'


And now it is time to test with real queries

In [194]:
query = "What are some negative aspects about this project?"
response = qa({"query":query})



[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
The context given are several messages retrieved from a chat about a blockchain project. Based on those messages try to answer the user question.
----------------
context: - User_5583737902: Another reason why presenting the case as a done deal before a vote is bad. It puts too much pressure on the project if a NO is passed.

- User_6165637792: What a way to ruin such a great project

- User_1086637470: Why I shall vote NO:

1. Rushed vote with WP non-present with enough time before vote to have a healthy debate. Remember: business intelligence tactics include things like this. Quick poorly informed false dicotomies presented as opportunities with time-pressure inciting to FOMO. Classic marketing /sales tactics.
2. The feeling that this prssure comes from an urge imposed by FETCH AI plans.
3. Badly negotiat

In [65]:
print(response["result"])

Some negative aspects about this project include:
- Lack of tangible results and financial returns for investors
- Concerns about the community potentially ruining the project
- Potential for a toxic community trying to take control in the name of decentralization
- Uncertainty about the effectiveness of the project and its various side projects
- Skepticism about the negotiation process and the overall deal
- Disappointment in the lack of adoption and use case across all three projects
- Disillusionment with the project's progress and outcomes, leading to doubts about its long-term viability and success


In [201]:
query = "How to do staking?"
response = qa({"query":query})



[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
The context given are several messages retrieved from a chat about a blockchain project. Based on those messages try to answer the user question.
----------------
context: - User_6480686176: How to stake?
- User_1723326597: !staking

- User_510661770: How to unstake?

- User_7131149338: How can I stake my agix

- User_1912557943: Hello, I have some questions about staking. Who can assist? :)

- User_6989780283: /staking

- User_6989780283: /staking

- User_378034663: Does anyone know the staking rewards?

- User_1518954403: Hey guys, is staking worth it ?

- User_327618551: How much fees for staking

- User_6281353071: How can i stake agix?
- User_1723326597: !staking

- User_327618551: /staking

- User_6546249709: Where do I stake the token to participate in the launch pad

- User_2007302667: !staking

- U

In [202]:
wrapped_text = textwrap.fill(response["result"], width=100)
print(wrapped_text)

Based on the chat messages, it seems that the command "!staking" is commonly used to inquire about
staking in the blockchain project. So, to stake in this project, you can try using the command
"!staking" or "/staking" to get more information or assistance on how to stake your tokens.
Additionally, you can also ask for help or clarification from other users who may have experience
with staking in the project.


In [204]:
query = "How to contact support?"
response = qa({"query":query})



[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
The context given are several messages retrieved from a chat about a blockchain project. Based on those messages try to answer the user question.
----------------
context: - User_7099962622: Any support team here I need help
- User_503303035: What do you need?

- User_1684147743: Is there a support contact foe staking

- User_5684934395: Thank you
- User_1723326597: Please use the contact info above and staking support will assist when they are back online on Monday. Include relevant info in your email.

- User_6273563012: please, how do i buy?

- User_6037030851: I've an issue and need assistance.
- User_503303035: Feel free to ask.

- User_5830833969: I sent a request via "contact" section on the official website, but I didn't receive any response. Is there someone who can assist with official communicati

In [205]:
wrapped_text = textwrap.fill(response["result"], width=100)
print(wrapped_text)

To contact support for the blockchain project, you can use the contact information provided on the
official website. Staking support will assist when they are back online on Monday. You can also
reach out to team members directly via direct message (DM) on the chat platform. Additionally, you
can send an email to the support team for assistance.


In [206]:
query = "Information for developers"
response = qa({"query":query})



[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
The context given are several messages retrieved from a chat about a blockchain project. Based on those messages try to answer the user question.
----------------
context: - User_7086847052: Does anyone look for a developer?

- User_7163446993: Is there anyone looking for a developer?

- User_7026075085: plz let me know if anyone looking for a dev

- User_454084975: Our website has more info

- User_6974589243: Does anyone need a developer?
- User_1723326597: !jobs

- User_6974589243: Does anyone look for a developer?
- User_1723326597: No, thanks

- User_1278149190: Does anyone need a d eveloper?
- User_1723326597: !jobs

- User_6537699716: Does anyone need D2veloper?

- User_527061814: For me a single token is a prerequisite to offering developers (external) a joined up workflow. The more they can do to i

In [207]:
wrapped_text = textwrap.fill(response["result"], width=100)
print(wrapped_text)

Based on the messages in the chat, it seems like there are developers actively looking for
opportunities and projects to work on. Some users have mentioned their experience and skills in web
development, blockchain technologies, and AI services. Additionally, there are references to
websites with more information about projects and opportunities for developers. If you are a
developer looking for information or opportunities, you may want to reach out to some of the users
in the chat who have expressed interest in working on projects or check out the websites mentioned
for more information.


# Conclusions

Both semantic search and retrieval augmented generation output qualitty depend directly on the quality of the source data. In this case, telegram channels do not always seem to provide relevant information about a project but might be suited for specific user queries such as finding answers of a specific technical question. But for generaly queries about the project uses cases, this data sources is not suitable. Other data sources should be explored for better quality.