## DATA 255 : HOMEWORK 12 - LangChain

#### Part A: Build a code understanding model. Upload your own custom code files to the model and ask questions based on the code file as context.

In [1]:
import openai
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
import os
from langchain.prompts import ChatPromptTemplate
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain
from langchain.chains import RetrievalQA

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

**Setting up Open AI API key to start with**

In [2]:
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

if openai_api_key is None:
    raise ValueError("API key is not set. Please set the OPENAI_API_KEY environment variable.")

**Initializing the LangChain ChatOpenAI model**

In [3]:
chat_model = ChatOpenAI(openai_api_key=openai_api_key, temperature=0)

**Defining a prompt template for understanding code**

In [4]:
template_string = """
You are an expert software engineer. Analyze the following code and answer the question provided.

Code: {text}

Question: {question}

Provide a detailed yet concise response.
"""
prompt_template = ChatPromptTemplate.from_template(template_string)

**Reading the custom code file**

In [5]:
def read_code_file(file_path):
    with open(file_path, 'r') as file:
        return file.read()

**Function to add file to the vectorstore**

In [6]:
def add_code_to_vectorstore(file_path):
    code_content = read_code_file(file_path)
    vectorstore.add_texts([code_content], metadata={"file_path": file_path})

**Getting response from the chat model**

In [7]:
def generate_code_query(chat_model, prompt_template, code_content, question):
    messages = prompt_template.format_messages(text=code_content, question=question)
    response = chat_model.invoke(messages)
    return response.content

**Initializing memory for storing conversation and creating a chain with memory**

In [8]:
memory = ConversationBufferMemory()

conversation = ConversationChain(
    llm=chat_model,
    memory=memory,
    verbose=True
)

  memory = ConversationBufferMemory()
  conversation = ConversationChain(


In [9]:
response = conversation.predict(input="What does this function do?")
print(response)



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:

Human: What does this function do?
AI:[0m

[1m> Finished chain.[0m
This function is designed to calculate the square root of a given number. It takes the input number, checks if it is a positive number, and then uses a mathematical algorithm to find the square root. The result is then returned as the output of the function.


**Workflow for analyzing the code and generate responses**

In [10]:
def analyze_code_with_langchain(file_path, question):
    relevant_docs = retriever.invoke(input=question, n_results=2)
    code_content = "\n".join([doc.page_content for doc in relevant_docs])
    response = generate_code_query(chat_model, prompt_template, code_content, question)
    return response

**Initializing the Chroma vectorstore and embeddings**

In [11]:
vectorstore = Chroma(embedding_function=OpenAIEmbeddings(openai_api_key=openai_api_key), persist_directory="./chroma_db")

**Usage: Loaded the custom code file "RedditDashboard.py"**

In [12]:
code_files = ["RedditDashboard.py"] 

for file_path in code_files:
    add_code_to_vectorstore(file_path)

retriever = vectorstore.as_retriever()

**Asking question to model about code**

In [13]:
file_path = "RedditDashboard.py"
question = "What is the name of the main function in the code and what does it do?"
response = analyze_code_with_langchain(file_path, question)
print(response)

The name of the main function in the code is "main()". 

The main function sets up the Social Media Listening Dashboard by configuring the page layout to a wider format, importing a CSS file for styling, and creating the main components of the dashboard such as the title, sidebar, keyword selection field, and search button. 

It also handles the search functionality where users can enter a keyword, search for it, and view the results in different tabs including Word Analysis, Sentiment Analysis, Sentiment Trends, Market Funnel, and Comment Analysis. 

Additionally, the main function displays metrics such as reach, engagement, and share of voice related to the searched keyword, and provides a brief overview of social media listening. It utilizes various functions from an external file called "functions.py" to perform data fetching, data cleaning, metric calculation, and visualization tasks.


In [14]:
question = "Does the dashboard page it sets up have any tabs, if yes, then what are the names of the tabs?"
response = analyze_code_with_langchain(file_path, question)

In [15]:
print(response)

Yes, the dashboard page set up by the main function in the provided code has tabs. The names of the tabs are:
1. Word Analysis
2. Sentiment Analysis
3. Sentiment Trends
4. Market Funnel
5. Comment Analysis

These tabs are used to display different types of analysis and visualizations related to the social media listening data fetched based on the keyword entered by the user. Each tab contains specific analysis or visualization components related to the keyword search.


In [16]:
question = "Is it importing any other non standard libraries, if yes, what functions is it calling from that?"
response = analyze_code_with_langchain(file_path, question)
print(response)

The code provided is importing a custom module named "functions" using the statement "import functions as func". This custom module contains various functions related to data processing and visualization for the Social Media Listening Dashboard. The code is calling functions from this custom module such as:
- start_data_fetch(keyword): This function is used to fetch data related to the specified keyword from social media sources.
- dataCleaning(posts): This function is used to clean and preprocess the fetched data before analysis.
- get_metrics(posts, keyword): This function calculates metrics like reach, engagement, and share of voice based on the processed data.
- generate_word_histogram(posts): This function generates a histogram of word frequencies in the posts.
- create_word_cloud(posts): This function creates a word cloud visualization based on the posts.
- create_sentiment_plot(posts): This function creates a plot showing the sentiment distribution of the posts.
- plot_aspect_se

**Our model was able to answer the question on the custom code with accuracy and relevant responses**

### Part B: Write a chatbot prompt to iteratively create a sequence of chats on one particular custom data.
1. The chatbot should be able to answer the questions based on the text data or multiple documents.
2. The chatbot should save the conversation in the memory.
2. Summarize the chats at the end of the conversation.

In [17]:
import openai
from dotenv import load_dotenv
from langchain.chat_models import ChatOpenAI
import os
#from langchain.prompts import ChatPromptTemplate
from langchain.chains import ConversationChain
from langchain.chains.question_answering import load_qa_chain
from langchain.prompts import PromptTemplate
from langchain.chains import ConversationalRetrievalChain
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.memory import ConversationBufferMemory
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
#from langchain.document_loaders import TextLoader

**Load and Split Documents into chunks**

In [18]:
documents = []
folder_path = "./"

for file_name in os.listdir(folder_path):
    if file_name.endswith(".pdf"):
        file_path = os.path.join(folder_path, file_name)
        loader = PyPDFLoader(file_path)
        raw_documents = loader.load()
        
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,  
            chunk_overlap=100  
        )
        documents.extend(text_splitter.split_documents(raw_documents))

**Printing the metadata of the documents to see if documents are rightly split**

In [19]:
for i, doc in enumerate(documents):
    print(f"Document {i+1}:")
    print("Metadata:", doc.metadata)
    print("Content Preview:", doc.page_content[:200], "...")  
    print("="*50)

Document 1:
Metadata: {'source': './Artificial_General_Intelligence_Concept_State_of_t.pdf', 'page': 0}
Content Preview: Journal of Artiﬁcial General Intelligence 5(1) 1-46, 2014 Submitted 2013-2-12
DOI: 10.2478/jagi-2014-0001 Accepted 2014-3-15
Artiﬁcial General Intelligence:
Concept, State of the Art, and Future Prosp ...
Document 2:
Metadata: {'source': './Artificial_General_Intelligence_Concept_State_of_t.pdf', 'page': 0}
Content Preview: biology inspired perspectives. The spectrum of designs for AGI systems includes systems with
symbolic, emergentist, hybrid and universalist characteristics. Metrics for general intelligence are
evalua ...
Document 3:
Metadata: {'source': './Artificial_General_Intelligence_Concept_State_of_t.pdf', 'page': 0}
Content Preview: of the pursuit of discrete capabilities or speciﬁc practical tasks. But while this approach has yielded
many interesting technologies and theoretical results, it has proved relatively unsuccessful in  ...
Document 4:
Metadata: 

**Building FAISS Vectorstore. Embeddings are used from OpenAI**

In [20]:
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)

  embeddings = OpenAIEmbeddings()


**Defining Prompt Template for RAG chatbot**

In [21]:
retrieval_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="""
You are a helpful AI assistant. Use the information from the provided documents to answer questions accurately.

Context:
{context}

User's Question:
{question}

Answer:
"""
)

**Configuring Conversation Memory**

In [22]:
memory = ConversationBufferMemory(memory_key="history", return_messages=True)

**Building Retrieval QA Chain**

In [23]:
llm = ChatOpenAI(temperature=0, model_name="gpt-4") 
qa_chain = load_qa_chain(llm, chain_type="stuff", prompt=retrieval_prompt)

retrieval_chain = RetrievalQA(
    retriever=vectorstore.as_retriever(),
    combine_documents_chain=qa_chain,
    memory=memory
)

  llm = ChatOpenAI(temperature=0, model_name="gpt-4")  # Specify the OpenAI model
stuff: https://python.langchain.com/docs/versions/migrating_chains/stuff_docs_chain
map_reduce: https://python.langchain.com/docs/versions/migrating_chains/map_reduce_chain
refine: https://python.langchain.com/docs/versions/migrating_chains/refine_chain
map_rerank: https://python.langchain.com/docs/versions/migrating_chains/map_rerank_docs_chain

See also guides on retrieval and question-answering here: https://python.langchain.com/docs/how_to/#qa-with-rag
  qa_chain = load_qa_chain(llm, chain_type="stuff", prompt=retrieval_prompt)
  retrieval_chain = RetrievalQA(


**Interacting with the Chatbot**

In [24]:
print("Welcome to the Custom Data Chatbot! Ask any question.")
while True:
    user_input = input("\nYou: ")
    if user_input.lower() in ["exit", "quit"]:
        print("Exiting the chatbot.")
        break
        
    response = retrieval_chain.run(query=user_input)
    print(f"AI: {response}")

Welcome to the Custom Data Chatbot! Ask any question.



You:  What is the document?


  response = retrieval_chain.run(query=user_input)


AI: The document appears to be an academic or scholarly article discussing aspects of Artificial General Intelligence (AGI). It includes references to various sources and discusses topics such as cognitive architecture, the measurement of AGI, and the variety of approaches to AGI. The document also mentions the level of agreement within the AGI community.



You:  What is Artificial General Intelligence?


AI: Artificial General Intelligence (AGI) refers to the creation and study of software or hardware systems with general intelligence comparable to, and ultimately perhaps greater than, that of human beings. A generally intelligent system should be good at generalizing the knowledge it’s gained, so as to transfer this knowledge from one problem or context to others. However, real-world general intelligences are inevitably somewhat biased toward certain sorts of goals and environments. The spectrum of designs for AGI systems includes systems with extensive, built-in capabilities, including the ability to improve through learning.



You:  What is the core of AGI hypothesis?


AI: The core AGI hypothesis is the belief that the creation and study of synthetic intelligences with sufficiently broad (e.g. human-level) scope and strong generalization capability, is fundamentally different from the creation and study of synthetic intelligences with significantly narrower scope and weaker generalization capability. This hypothesis is widely accepted within the AGI community.



You:  What are the competencies that scientist understand humans to display? 


AI: The competencies that scientists understand humans to display, as assembled at the 2009 AGI Roadmap Workshop, include:

1. Perception
   - Vision: image and scene analysis and understanding
   - Hearing: identifying the sounds associated with common objects; understanding which sounds come from which sources in a noisy environment
   - Touch: identifying common objects and carrying out common actions using touch alone
   - Crossmodal: Integrating information from various senses
2. Social construction: assembling new social groups, modifying existing ones.



You:  What is ACT-R?


AI: The document does not provide specific information on what ACT-R is.



You:  What is Emergentist AGI Approaches?


AI: Emergentist AGI (Artificial General Intelligence) approaches expect every aspect of intelligence, including abstract symbolic processing, to emerge from lower-level "subsymbolic" dynamics. These dynamics are sometimes designed to simulate neural networks or other aspects of human brain function. Emergentist architectures are often strong at recognizing patterns in high-dimensional data, reinforcement learning, and associative memory. However, it has not yet been convincingly demonstrated how to achieve high-level functions such as abstract reasoning or complex language processing using a purely subsymbolic, emergentist approach. Another potential emergentist approach to AGI is to simulate a different type of biology, such as the evolving ecosystem that gave rise to the brain in the first place.



You:  exit


Exiting the chatbot.


**Summarizing the Conversation by creating a prompt for summarization**

In [25]:
summary_prompt = PromptTemplate(
    input_variables=["context"],
    template="""
You are a helpful AI assistant. Below is the conversation history. Summarize it concisely while preserving the main points discussed.

Conversation History:
{context}

Summary:
"""
)

In [26]:
conversation_history = "\n".join([message.content for message in memory.chat_memory.messages])

In [27]:
from langchain.schema import Document
input_documents = [Document(page_content=conversation_history)]

summary_chain = load_qa_chain(llm, chain_type="stuff", prompt=summary_prompt)
summary_response = summary_chain.run(input_documents=input_documents)

print("\nConversation Summary:")
print(summary_response)


Conversation Summary:
The conversation revolves around the topic of Artificial General Intelligence (AGI), an academic field focused on creating software or hardware systems with intelligence comparable to or greater than humans. The document discussed is an academic article that covers various aspects of AGI, including cognitive architecture, measurement, and different approaches. The core AGI hypothesis, widely accepted in the AGI community, posits that creating synthetic intelligences with broad scope and strong generalization capability is fundamentally different from creating ones with narrower scope and weaker generalization. Human competencies, as identified at the 2009 AGI Roadmap Workshop, include perception and social construction. The conversation also touched on Emergentist AGI approaches, which expect all aspects of intelligence to emerge from lower-level "subsymbolic" dynamics, often designed to simulate neural networks or other aspects of human brain function. However, 

**Deleting the vectorstore to save memory**

In [28]:
del vectorstore

## Thank You!