# RAG that predicts goal based on current environment

This is an example project for making a goal prediction module

In [1]:
import os
from dotenv import load_dotenv
from langchain_openai.chat_models import ChatOpenAI
# parser to extract string from answer
from langchain_core.output_parsers import StrOutputParser


In [None]:
# delete keys before commiting to github
OPENAI_API_KEY = "YOUR KEY"
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-3.5-turbo")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

os.environ['PINECONE_API_KEY'] = 'YOUR KEY'


## Invoking model like this
- using parser to get string output

In [None]:


from langchain.prompts import ChatPromptTemplate

template = """
Answer the question based on the context below. If you can't 
answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
prompt.format(context="Mary's sister is Susana", question="Who is Mary's sister?")
parser = StrOutputParser()

# prompt -> model -> parser structure
chain = prompt | model | parser
chain.invoke({
    "context": "Mary's sister is Susana",
    "question": "Who is Mary's sister?"
})

## Transcribing Json Dataset into a database

- We will trascribe a json dataset into a database
- We will then use search query to find relevant context based on the query
- The query will give information on current structure of the environment.
- The quesry will also ask the overlaying goal of the current user

### Embedding a query

- embed the querly using OpenAIEmbeddings

In [None]:
from langchain_openai.embeddings import OpenAIEmbeddings

queryToEmbed = "What is the goal of the user?"
embeddings = OpenAIEmbeddings()
embedded_query = embeddings.embed_query(queryToEmbed)

# len of the embedding will be set as the lenght set in the browser interface
print(f"Embedding length: {len(embedded_query)}")

### Using VectorStore to make a dataset in memory

In [None]:
from langchain_community.vectorstores import DocArrayInMemorySearch

vectorstore1 = DocArrayInMemorySearch.from_texts(
    [
        "ddd",
        "fff",
        "ggg"
    ],
    embedding=embeddings,
)

## TEXT / JSON splitting to Documents
### B-1 TEXT Loading and Splitting
1. load transcription
2. load transcription into loader (txt->memory)

In [None]:
# EXAMPLE in loading
with open("transciption.txt") as file:
    transcription = file.read()

transcription[:100]

For loading transcription, use use loader instead of reader.

In [None]:
# LOAD
from langchain_community.document_loaders import TextLoader
loader = TextLoader("transcription.txt")
text_documents = loader.load()
text_documents

# JUST use default splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)

When using json, we can use CharacterTextSplitter. However, this can kill context in a json file. How about that we use splitters that can recognize arrays or key-value pairs, and split json while respecting the file's structure?

### B-2 JSON: use RecursiveJsonSplitter
- https://python.langchain.com/docs/how_to/recursive_json_splitter/
- use `RecursiveJsonSplitter` & `.split_json `

Most chatgpt examples after the second question use CharacterTextSplitter or derivations from it. Maybe not the best use cases.

In [None]:
from langchain_text_splitters import RecursiveJsonSplitter
import json

# load and json splitter
with open("myjson.json", "r") as f:
    json_data = json.load(f)

# OPT1: GET CHUNKS
json_splitter = RecursiveJsonSplitter(max_chunk_size=300)
json_chunks= json_splitter.split_json(json_data=json_data)

# OPT2: GET DOCUMENTS
docs = json_splitter.create_documents(texts=[json_data])
for doc in docs[:3]:
    print(doc)

# OPT3: GET TEXTS
texts = json_splitter.split_text(json_data=json_data)
print(texts[0])
print(texts[1])

### B-2 JSON: Other recommended methods
- https://chatgpt.com/c/673fe115-4d18-8007-beb2-027fd355216e
- see answer after the first question. These answers give insight, but not the most appropriate solutions.

In [None]:
# # use ijson: STREAM & Process to chunks for large json file. Not the most suitable example
# import ijson
# with open('large_file.json', 'r') as f:
#     parser = ijson.items(f, 'items')  # Adjust 'items' based on your JSON structure
#     for item in parser:
#         print(item)

# # split based on array elements: RecursiveJsonSplitter can do similar thinks I guess
# import json

# with open('large_file.json', 'r') as f:
#     data = json.load(f)

# # Assume the large array is under 'data'
# chunk_size = 100
# chunks = [data['data'][i:i + chunk_size] for i in range(0, len(data['data']), chunk_size)]

# # Save each chunk to a separate file
# for idx, chunk in enumerate(chunks):
#     chunk_data = {'data': chunk}  # Re-wrap in a JSON structure if needed
#     with open(f'chunk_{idx}.json', 'w') as f:
#         json.dump(chunk_data, f)

## C: Configuring multiple dataset stores

- C-1: embed separately and merge to single dataset
- C-2: separate embeddings and search in parallel using langchain's `MultiRetriever`