Part B: Write a chatbot prompt to iteratively create a sequence of chats on one particular custom data.
1. The chatbot should be able to answer the questions based on the text data or multiple documents.
2. The chatbot should save the conversation in the memory.
2. Summarize the chats at the end of the conversation.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd '/content/drive/My Drive/Data 255 Spring 2024/Google Colab/Homework12'

/content/drive/My Drive/Data 255 Spring 2024/Google Colab/Homework12



Install necessary libraries



In [None]:
!pip install python-dotenv
!pip install openai==0.28

In [None]:
!pip install -qU \
    langchain==0.0.354 \
    openai==1.6.1 \
    datasets==2.10.1 \
    pinecone-client==3.1.0 \
    tiktoken==0.5.2

In [None]:
# Load necessary libraries

import os
import openai

import langchain
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory

In [None]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import ConversationalRetrievalChain

In [None]:
from dotenv import load_dotenv, find_dotenv
# read local .env file
_ = load_dotenv(find_dotenv())
openai.api_key = 'xxx'

In [None]:
os.environ["OPENAI_API_KEY"] = 'xxx'

In [None]:
# account for deprecation of LLM model
import datetime
# Get the current date
current_date = datetime.datetime.now().date()

# Define the date after which the model should be set to "gpt-3.5-turbo"
target_date = datetime.date(2024, 6, 12)

# Set the model variable based on the current date
if current_date > target_date:
    llm_model = "gpt-3.5-turbo"
else:
    llm_model = "gpt-3.5-turbo-0301"

Polar sentiment dataset of sentences from financial news. The dataset consists of 4840 sentences from English language financial news categorised by sentiment. The dataset is divided by agreement rate of 5-8 annotators.
0=negative; 1=neutral;2=positive

In [None]:
from datasets import load_dataset

dataset = load_dataset(
    "financial_phrasebank",
    "sentences_allagree",
     split="train",
)

dataset



Dataset({
    features: ['sentence', 'label'],
    num_rows: 2264
})

In [None]:
dataset[0]

{'sentence': 'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .',
 'label': 1}

In [None]:
# Load necessary libraries related to Pinecone vector DB

from pinecone import Pinecone

# initialize connection
api_key_pinecone = os.getenv("")

# configure client
pc = Pinecone(api_key="xxx")

In [None]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

In [None]:
import time

index_name = 'llama-2-rag'
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1536,  # dimensionality of ada 002
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings

embed_model = OpenAIEmbeddings(model="text-embedding-ada-002")

In [None]:
texts = [
    'this is the first chunk of text',
    'then another second chunk of text is here'
]

res = embed_model.embed_documents(texts)
len(res), len(res[0])

(2, 1536)

In [None]:
print(dataset.column_names)  # Prints the names of the columns in the dataset
print(dataset.features)      # Prints the features and their data types


['sentence', 'label']
{'sentence': Value(dtype='string', id=None), 'label': ClassLabel(names=['negative', 'neutral', 'positive'], id=None)}


In [None]:
def print_batch(batch):
    print(batch['sentence'])
    print(batch['label'])

# Process and print in batches
dataset.map(print_batch, batched=True, batch_size=3)


In [None]:
data = dataset.to_pandas()  # this makes it easier to iterate over the dataset

In [None]:
len(data)

2264

In [None]:
data.head(3)

Unnamed: 0,sentence,label
0,"According to Gran , the company has no plans t...",1
1,"For the last quarter of 2010 , Componenta 's n...",2
2,"In the third quarter of 2010 , net sales incre...",2


In [None]:
#!pip install unidecode

In [None]:
from tqdm.auto import tqdm
# To handle non-ASCII characters
import unidecode

def sanitize_id(text):
    """Convert text to ASCII, removing or replacing non-ASCII characters."""
    return unidecode.unidecode(text)

batch_size = 100

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i + batch_size)
    batch = data.iloc[i:i_end]

    # Generate sanitized, unique ids for each chunk
    ids = [sanitize_id(f"{x['sentence']}-{x['label']}") for _, x in batch.iterrows()]

    # Get text to embed
    texts = [x['sentence'] for _, x in batch.iterrows()]

    # Embed text
    embeds = embed_model.embed_documents(texts)

    # Get metadata to store in Pinecone
    metadata = [
        {'text': x['sentence'],
         'label': x['label']} for _, x in batch.iterrows()
    ]

    # Add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))


  0%|          | 0/23 [00:00<?, ?it/s]

In [None]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 2259}},
 'total_vector_count': 2259}

In [None]:
from langchain.vectorstores import Pinecone
# the metadata field that contains our text
text_field = "text"

# initialize the vector store object
vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)

  warn_deprecated(


## 1. The chatbot should be able to answer the questions based on the text data or multiple documents.

In [None]:
query = "What is the most popular company mentioned?"

vectorstore.similarity_search(query, k=3)

[Document(page_content="The company 's market share is continued to increase further .", metadata={'label': 2.0}),
 Document(page_content='The report profiles 614 companies including many key and niche players worldwide such as Black & Decker Corporation , Fiskars Corporation , Fiskars Brands , Inc. , Husqvarna Outdoor Products Inc. , K+S Group , Ryobi Technologies , Inc. , The Scotts Miracle-Gro Company , and Van Group , Inc. .', metadata={'label': 1.0}),
 Document(page_content="S Group 's loyal customer magazine Yhteishyv+ñ came second with 1,629,000 readers and Sanoma Corporation 's daily newspaper Helsingin Sanomat was third with 1,097,000 readers .", metadata={'label': 1.0})]

In [None]:
query = "What is the sentiment mostly?"

vectorstore.similarity_search(query, k=3)

[Document(page_content='What we think ?', metadata={'label': 1.0}),
 Document(page_content='Curators have divided their material into eight themes .', metadata={'label': 1.0}),
 Document(page_content='`` The trend in the sports and leisure markets was favorable in the first months of the year .', metadata={'label': 2.0})]

In [None]:
query = "Which year is the data collected for?"

vectorstore.similarity_search(query, k=3)

[Document(page_content='The studies are expected to start in 2008 .', metadata={'label': 1.0}),
 Document(page_content='Also , a six-year historic analysis is provided for this market .', metadata={'label': 1.0}),
 Document(page_content='Market data and analytics are derived from primary and secondary research .', metadata={'label': 1.0})]

In [None]:
query = "Are the stock markets up or down?"

vectorstore.similarity_search(query, k=3)

[Document(page_content="Markets had been expecting a poor performance , and the company 's stock was up 6 percent at  x20ac 23.89 US$ 33.84 in early afternoon trading in Helsinki .", metadata={'label': 2.0}),
 Document(page_content="The broad-based WIG index ended Thursday 's session 0.1 pct up at 65,003.34 pts , while the blue-chip WIG20 was 1.13 down at 3,687.15 pts .", metadata={'label': 1.0}),
 Document(page_content='LONDON MarketWatch -- Share prices ended lower in London Monday as a rebound in bank stocks failed to offset broader weakness for the FTSE 100 .', metadata={'label': 0.0})]

In [None]:
def augment_prompt(query: str):
    # get top 3 results from knowledge base
    results = vectorstore.similarity_search(query, k=3)
    # get the text from the results
    source_knowledge = "\n".join([x.page_content for x in results])
    # feed into an augmented prompt
    augmented_prompt = f"""Using the contexts below, answer the query.

    Contexts:
    {source_knowledge}

    Query: {query}"""
    return augmented_prompt

In [None]:
print(augment_prompt(query))

Using the contexts below, answer the query.

    Contexts:
    Markets had been expecting a poor performance , and the company 's stock was up 6 percent at  x20ac 23.89 US$ 33.84 in early afternoon trading in Helsinki .
The broad-based WIG index ended Thursday 's session 0.1 pct up at 65,003.34 pts , while the blue-chip WIG20 was 1.13 down at 3,687.15 pts .
LONDON MarketWatch -- Share prices ended lower in London Monday as a rebound in bank stocks failed to offset broader weakness for the FTSE 100 .

    Query: Are the stock markets up or down?


In [None]:
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

In [None]:
chat = ChatOpenAI(
    openai_api_key=os.environ["OPENAI_API_KEY"],
    model='gpt-3.5-turbo'
)

In [None]:
# create a new user prompt
prompt = HumanMessage(
    content=augment_prompt(query)
)
# add to messages
messages.append(prompt)

res = chat(messages)

print(res.content)

Based on the provided contexts, the stock markets mentioned are experiencing mixed results. The WIG index ended Thursday's session slightly up by 0.1%, while the blue-chip WIG20 was down by 1.13%. In London, share prices ended lower on Monday, despite a rebound in bank stocks. So, the stock markets mentioned are a mix of up and down movements.


In [None]:
prompt = HumanMessage(
    content="is it a good time to invest?"
)

res = chat(messages + [prompt])
print(res.content)

Based on the provided contexts, it appears that the markets have been expecting a poor performance. However, the company's stock was up by 6 percent at a particular point in time. Additionally, the WIG index ended slightly up, while the blue-chip WIG20 was down. In London, share prices ended lower despite a rebound in bank stocks. Considering these mixed signals and the uncertainty in the market, it may not be a clear-cut good time to invest. It is advisable to conduct further research and consult with financial advisors before making investment decisions.


In [None]:
prompt = HumanMessage(
    content=augment_prompt(
        "is it a good time to invest?"
    )
)

res = chat(messages + [prompt])
print(res.content)

Based on the provided contexts, it seems that investors are still interested in the company's shares, indicating potential future growth. However, since the investments are not disclosed and the company is currently evaluating the financial feasibility of a project, it may be advisable to gather more information before making an investment decision. It could be a good time to consider investing once more details about the project and investments become available.


## 2. The chatbot should save the conversation in the memory.

In [None]:
# Save to memory

llm = ChatOpenAI(openai_api_key=openai.api_key, temperature=0.0)
memory = ConversationBufferMemory()
conversation = ConversationChain(
    llm=llm,
    memory = memory,
    verbose=True
)

In [None]:
memory.load_memory_variables({})

{'history': "Human: Based on the provided contexts, it seems that investors are still interested in the company's shares, indicating potential future growth. However, since the investments are not disclosed and the company is currently evaluating the financial feasibility of a project, it may be advisable to gather more information before making an investment decision. It could be a good time to consider investing once more details about the project and investments become available.\nAI: Yes, that's a very astute observation. Investors are indeed showing interest in the company's shares, which could be a positive sign for potential future growth. The fact that the investments are not disclosed yet does add a level of uncertainty, but it's always wise to gather as much information as possible before making any investment decisions. Once the company evaluates the financial feasibility of the project and more details are revealed, it could provide a clearer picture for potential investors

In [None]:
conversation.predict(input="Is it a good time to invest?")



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
Human: Based on the provided contexts, it seems that investors are still interested in the company's shares, indicating potential future growth. However, since the investments are not disclosed and the company is currently evaluating the financial feasibility of a project, it may be advisable to gather more information before making an investment decision. It could be a good time to consider investing once more details about the project and investments become available.
AI: Yes, that's a very astute observation. Investors are indeed showing interest in the company's shares, which could be a positive sign for potential future growth. The fact that t

"Based on the current information available, it may be a good time to consider investing once more details about the project and investments are disclosed. It's important to gather as much information as possible and assess the potential risks and rewards before making any investment decisions. Keep monitoring the company's developments and financial feasibility evaluations to make an informed choice."

## 3. Summarize the chats at the end of the conversation.

In [None]:
conversation.predict(input="Can you summarize all the chats for me?")



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
Human: Based on the provided contexts, it seems that investors are still interested in the company's shares, indicating potential future growth. However, since the investments are not disclosed and the company is currently evaluating the financial feasibility of a project, it may be advisable to gather more information before making an investment decision. It could be a good time to consider investing once more details about the project and investments become available.
AI: Yes, that's a very astute observation. Investors are indeed showing interest in the company's shares, which could be a positive sign for potential future growth. The fact that t

"Sure! In summary, investors are showing interest in the company's shares, indicating potential future growth. However, since the investments are not disclosed and the company is evaluating the financial feasibility of a project, it may be advisable to gather more information before making an investment decision. It could be a good time to consider investing once more details about the project and investments become available. It's important to stay informed and assess the potential risks and rewards before making any investment decisions."

In [None]:
#Delete the index to save resources:
pc.delete_index(index_name)