# Medusa E-Comm RAG Docs

***
### **Last Updated: January 12, 2024**

#### todo notes:

I think we're seeing a version 3.0 model where it has a local PostgreSQL data-frame and fast embedding's preventing rate-limits. To do this, we'd have to convert from using the csv files existing, and rewrite our embedding code. With the latest updates to both the core Langchain Library, and OpenAI major updates with version 1.0.0 there will be plenty of room for advancement. Our current generations are using OpenAI API version 0.28. 

1.   update OpenAI version & corresponding embedding functions 
2.   update Langchain errors which show depricated modules being used and new Langchain-Openai moduels 
3.   need to add streaming to the output text.
4.   make corrections/ updates to langsmith (as needed)
5.   convert existing pandas.df to local Postfresql server/ possibly log chat sequence and enhance conversational history 

#### questions/ answers history:
i. as it stands, the memory is stored in the application's runtime memory. This means that the memory exists as long as the application is running and will be lost once the application is terminated. There is no standalone chat-history so it's best-practice to take notes as development continues. 

In [1]:
# MASTER-CODEBLOCK
##################################
# Block 000 Initialize Dependencies
#### REQUIREMENTS.TXT (reference)

! pip install "deeplake[enterprise]"
! pip install langchain
! pip install nltk
! pip install openai==0.28
! pip install pandas
! pip install pdfminer
! pip install pdfminer.six
! pip install plotly
! pip install -U scikit-learn
! pip install tiktoken
! pip install torch
! pip install transformers
! pip install tqdm

Collecting deeplake[enterprise]
  Using cached deeplake-3.8.14-py3-none-any.whl
[0mCollecting numpy (from deeplake[enterprise])
  Obtaining dependency information for numpy from https://files.pythonhosted.org/packages/94/9c/f1e88764737c126637d0434df712b1baa371a404a3e3751ee997e74e164b/numpy-1.26.3-cp312-cp312-macosx_11_0_arm64.whl.metadata
  Using cached numpy-1.26.3-cp312-cp312-macosx_11_0_arm64.whl.metadata (61 kB)
Collecting pillow (from deeplake[enterprise])
  Obtaining dependency information for pillow from https://files.pythonhosted.org/packages/9d/a0/28756da34d6b58c3c5f6c1d5589e4e8f4e73472b55875524ae9d6e7e98fe/pillow-10.2.0-cp312-cp312-macosx_11_0_arm64.whl.metadata
  Using cached pillow-10.2.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (9.7 kB)
Collecting boto3 (from deeplake[enterprise])
  Obtaining dependency information for boto3 from https://files.pythonhosted.org/packages/e3/f7/93a4ba1cd2cc4ee95f871b0890e4ed60e52365110a074e7265279750a736/boto3-1.34.18-py3-none-any.whl.meta

In [2]:
# MASTER-CODEBLOCK
##################################
# Block 001 Initialize Import Statements
#### IMPORT STATEMENTS (reference)

import os
import json
# import matplotlib
# import matplotlib.pyplot as plt
import nltk
import numpy as np
import openai
import pandas as pd
import re
# import torch
from collections import Counter
from deeplake.core.vectorstore import VectorStore
from langchain.chains import RetrievalQA, ConversationalRetrievalChain
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.memory import ConversationTokenBufferMemory
from langchain.prompts import PromptTemplate
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import DeepLake
from nltk.corpus import stopwords
import pandas as pd
from pdfminer.high_level import extract_text, extract_pages
from sklearn.cluster import KMeans
from tqdm import tqdm
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2TokenizerFast

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [3]:
# MASTER-CODEBLOCK
##################################
# Block 002 Initialize Pandas Database
#### CREATE Pandas-DB INDEX

# Download necessary NLTK data
nltk.download('stopwords')

# Stop words from NLTK
stop_words = set(stopwords.words('english'))

# Set the max cell size for text (32,767 characters = true limit)
MAX_CELL_SIZE = 11250

# Directory containing your PDFs
pdf_directory = '/Users/matthewsimon/Documents/GitHub/acdc.vendure_v2/data/source_docs'

# List to store data
data = []

# Wrap the loop with tqdm for a progress bar
for pdf_file in tqdm(os.listdir(pdf_directory)):
    if pdf_file.endswith('.pdf'):
        file_path = os.path.join(pdf_directory, pdf_file)
        try:
            print(f"Processing {pdf_file}...")
            text = extract_text(file_path)

            if not text:
                print(f"Extracted text is empty for {pdf_file}")
                continue

            text_words_set = set(text.lower().split())
            filtered_words_set = text_words_set - stop_words
            filtered_text = ' '.join(filtered_words_set)

            # Basic heuristic: Assuming title is the first line and summary is the second line
            lines = text.split('\n')
            title = lines[0] if len(lines) > 0 else ''
            summary = lines[1] if len(lines) > 1 else ''

            # Metadata extraction
            file_size = os.path.getsize(file_path)
            number_of_pages = len(list(extract_pages(file_path)))

            # Filter Stopwords
            text_words = text.split()
            filtered_words = [word for word in text_words if word.lower() not in stop_words]
            filtered_text = ' '.join(filtered_words)

            # Text normalization
            text = text.lower()

            # Chunking the content
            chunks = [filtered_text[i:i+MAX_CELL_SIZE] for i in range(0, len(filtered_text), MAX_CELL_SIZE)]
            for chunk in chunks:
                data.append({
                    'filename': pdf_file,
                    'title_or_heading': title,
                    'content_summary': summary,
                    'content_chunk': chunk,
                    'file_size': file_size,
                    'number_of_pages': number_of_pages
                })

        except Exception as e:
            print(f"Error processing {pdf_file}: {e}")
            continue

# Convert the list to a DataFrame
df = pd.DataFrame(data)

# Save the DataFrame to a CSV for further analysis with escapechar
df.to_csv('/Users/matthewsimon/Documents/GitHub/acdc.vendure_v2/data/source_csv/source_docs.csv', index=False, escapechar='\\')

## print("Listing directory contents:")
## print(os.listdir(pdf_directory))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/matthewsimon/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  0%|          | 0/1 [00:00<?, ?it/s]

Processing Vendure-Connect-API.pdf...


100%|██████████| 1/1 [00:00<00:00,  2.42it/s]


In [5]:
# MASTER-CODEBLOCK
##################################
# Block 003 Initialize Embeddings & Update.df 
#### UPDATE PANDAS.DB INDEX WITH NEW EMBEDDINGS COLUMN 

nltk.download('stopwords')
from nltk.corpus import stopwords

def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = text.split()
    cleaned_text = " ".join([word for word in words if word.lower() not in stop_words])
    return cleaned_text

# Your existing setup (msimon@acdc.digital)
openai.api_key = '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'
input_datapath = '/Users/matthewsimon/Documents/GitHub/acdc.vendure_v2/data/source_csv/source_docs.csv'
df_check = pd.read_csv(input_datapath)
df = pd.read_csv(input_datapath)

# Check if DataFrame is empty
if df.empty:
    print("The DataFrame is empty. Please check your data source.")
else:
    # Your existing setup
    df = df[['filename', 'title_or_heading', 'content_summary', 'content_chunk', 'file_size']]

    # Remove stop words from 'content_chunk'
    df['cleaned_content_chunk'] = df['content_chunk'].apply(remove_stopwords)

    # Initialize the tokenizer
    tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

    # Function to return token IDs and decoded tokens
    def get_tokens_and_decoded(text):
        token_ids = tokenizer.encode(text, truncation=True, max_length=4095)
        decoded_tokens = [tokenizer.decode([token_id]) for token_id in token_ids]
        return token_ids, decoded_tokens

    # Add new columns for token counts, token IDs, and decoded tokens
    df['n_tokens'], df['tokens'] = zip(*df['cleaned_content_chunk'].apply(lambda x: (len(get_tokens_and_decoded(x)[0]), get_tokens_and_decoded(x)[0])))
    df['decoded_tokens'] = df['cleaned_content_chunk'].apply(lambda x: get_tokens_and_decoded(x)[1])

    # Filter rows based on token count
    df = df[df.n_tokens < 5000]

# Define the function to get embeddings using OpenAI's API
def get_embedding(text, engine):
    response = openai.Embedding.create(input=[text], engine=engine)
    return response['data'][0]['embedding']

# Generate embeddings
df['ada_similarity'] = df['cleaned_content_chunk'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))

# Save the DataFrame to a new CSV file
df.to_csv('/Users/matthewsimon/Documents/GitHub/acdc.vendure_v2/data/source_ada/source_ada.csv', index=False)
# Read the new CSV file to verify
df_new = pd.read_csv('/Users/matthewsimon/Documents/GitHub/acdc.vendure_v2/data/source_ada/source_ada.csv')

print(df_new.columns)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/matthewsimon/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Index(['filename', 'title_or_heading', 'content_summary', 'content_chunk',
       'file_size', 'cleaned_content_chunk', 'n_tokens', 'tokens',
       'decoded_tokens', 'ada_similarity'],
      dtype='object')


In [6]:
# MASTER-CODEBLOCK
##################################
# Block 004 Initialize Pandas & Activeloop DeepLake Database
#### LOAD PANDAS.DB TO DEEPLAKE

os.environ['ACTIVELOOP_TOKEN'] = 'eyJhbGciOiJIUzUxMiIsImlhdCI6MTcwNTA2NTU3MCwiZXhwIjoxNzM2Njg3OTUzfQ.eyJpZCI6ImFjZGNkaWdpdGFsIn0.C_L4DdFz7lKodj5MjMDmUZJLWrOmZn0AISyRBDQ5Qsi81QqOoOBJtaV4xvfTSqzusTkwz-SJ0IHCBgjwBcuQeQ'
# Load DataFrame from CSV
df = pd.read_csv('/Users/matthewsimon/Documents/GitHub/acdc.vendure_v2/data/source_ada/source_ada.csv')

# Prepare data
chunked_text = df['content_chunk'].tolist()
source_texts = df['filename'].tolist()
precomputed_embeddings = df['ada_similarity'].apply(eval).tolist()  # Assuming embeddings are stored as strings

# Initialize Vector Store with the Hub URL
vector_store_path = "hub://solomon/vendure-io"
vector_store = VectorStore(
    path=vector_store_path,
)

# Add data to Vector Store
vector_store.add(
    text=chunked_text,
    embedding=precomputed_embeddings,
    metadata=[{"source": source_text} for source_text in source_texts]
)

Your Deep Lake dataset has been successfully created!


 

Uploading data to deeplake dataset.


100%|██████████| 1/1 [00:01<00:00,  1.12s/it]
-

Dataset(path='hub://solomon/vendure-io', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape     dtype  compression
  -------    -------    -------   -------  ------- 
   text       text      (1, 1)      str     None   
 metadata     json      (1, 1)      str     None   
 embedding  embedding  (1, 1536)  float32   None   
    id        text      (1, 1)      str     None   


 

In [8]:
# MASTER-CODEBLOCK
##################################
# Block 005 Initialize Conversational Chat
#### EMBEDDING RETRIEVAL FOR DOCS-QA

# Initialize OpenAI
import os
os.environ['OPENAI_API_KEY'] = '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'
os.environ['ACTIVELOOP_TOKEN'] = 'eyJhbGciOiJIUzUxMiIsImlhdCI6MTcwNTA2NTU3MCwiZXhwIjoxNzM2Njg3OTUzfQ.eyJpZCI6ImFjZGNkaWdpdGFsIn0.C_L4DdFz7lKodj5MjMDmUZJLWrOmZn0AISyRBDQ5Qsi81QqOoOBJtaV4xvfTSqzusTkwz-SJ0IHCBgjwBcuQeQ'


# Your embedding function
def embedding_function(texts, model="text-embedding-ada-002"):
    if isinstance(texts, str):
        texts = [texts]
    texts = [t.replace("\n", " ") for t in texts]
    return [data['embedding'] for data in openai.Embedding.create(input=texts, model=model)['data']]

# Wrap your function in a class with an embed_query method
class MyEmbeddingFunction:
    def __init__(self, func):
        self.func = func

    def embed_query(self, query):
        return self.func(query)

# Initialize DeepLake database with the embedding_function
embedding_function_obj = MyEmbeddingFunction(embedding_function)
db = DeepLake(dataset_path="hub://solomon/vendure-io", embedding=embedding_function_obj, read_only=False)

# Initialize Retriever with parameters
retriever = db.as_retriever()
retriever.search_kwargs.update({
    'distance_metric': 'cos',
    'k': 4
})

# Define the PromptTemplate
template = """
You are Solomon. A new type of personal assistant. Your goal is to use your vast knowledge, and advanced artificial intellegence capabilities to help answer Users questions by retrieving and citing knowledge that may not have otherwise been identifiable with human capabilities alone. You must answer questions in a human-like manner. 

You can assume the User asking questions is an expert in the subject field of their question. Ensure to breakdown complex tasks into a sequence of manageable steps using your critical analysis. 

Use the following context to assist in answering any questions that come up. If you don't know the answer, or if the answer is not provided in the context in some way, just say that there's no relevant information within the context to answer the User's question. 
{context}
Question: {question}
Helpful Answer:
"""

# Create a PromptTemplate object
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

# Initialize LLM for QA
model = ChatOpenAI(model='gpt-4-1106-preview')

# Initialize Langchain Memory with Token Buffer
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.langchain.plus"
os.environ["LANGCHAIN_API_KEY"] = "ls__fbfe7701decf42138ac5d036eb60afc5"
os.environ["LANGCHAIN_PROJECT"] = "solomon.v2.2"

memory = ConversationTokenBufferMemory(  # <-- Changed to ConversationTokenBufferMemory
    llm=model,
    max_token_limit=450,
    memory_key="chat_history",
    return_messages=True
)

# Initialize Conversational Retrieval Chain with Memory
qa = ConversationalRetrievalChain.from_llm(
    llm=model,
    retriever=retriever,
    memory=memory
)

# Define your search query
search_query = 'Hi! Weve sucessfully initiated our Vendure.io backend/ admin pages, and weve now initated our Qwick storefront. We need to sucessfully connect the API to our frontend (storefront) and our backend (admin pages). Ive attahed the API-Connection docs for your reference. Are you able to guide me through this process?'

# Count the number of tokens in the search query and prompt
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
search_query_tokens = tokenizer.encode(search_query, truncation=True)
prompt_tokens = tokenizer.encode(template, truncation=True)
num_search_query_tokens = len(search_query_tokens)
num_prompt_tokens = len(prompt_tokens)

# Run the QA model with top-k documents
result = qa({"question": search_query})
response = result['answer']
print("\nQA Response:")
print(response)

# Count the number of tokens in the generated response
response_tokens = tokenizer.encode(response, truncation=True)
num_response_tokens = len(response_tokens)

# Print token counts
print(f"\nNumber of tokens in the search query: {num_search_query_tokens}")
print(f"Number of tokens in the prompt: {num_prompt_tokens}")
print(f"Number of tokens in the generated response: {num_response_tokens}")

# Extract and print unique sources (Top 3)
print("\nUnique Sources:")
docs = retriever.get_relevant_documents(search_query)
unique_sources = set(doc.metadata.get('source', 'N/A') for doc in docs)
unique_sources_top3 = list(unique_sources)[:3]
print(unique_sources_top3)

Deep Lake Dataset in hub://solomon/vendure-io already exists, loading from the storage


  warn_deprecated(
  warn_deprecated(



QA Response:
Absolutely! To connect your Qwick storefront to the Vendure backend using the Vendure Shop API, you'll need to follow these general steps:

1. **Set Up GraphQL Client:**
   You will need a GraphQL client to interact with the Vendure Shop API. You can use TanStack Query with graphql-request as a client. To set it up, you can refer to the `src/client.ts` example provided in the context to configure the GraphQLClient with request and response middleware for handling authentication tokens.

2. **Authentication:**
   Vendure supports two methods for session management: cookie-based and bearer token. If you opt for bearer token, ensure that you're storing the token in localStorage and adding it to the headers of outgoing GraphQL requests.

3. **Configure Vendure Client Settings:**
   - Set up the `API_URL` to point to your Vendure server's Shop API endpoint.
   - Configure the client to use credentials for cookie-based sessions if you're using cookies.
   - Add request and resp

In [None]:
# Code for asking questions
question = "If I'm using the TanStack Query library, do I need to install anything additional for GraphQL? Is GraphQL it's own library that I need to download in order for Vendure to work, I don't understand why I need it or how to set it up in order to use Tanstack with it- can you please explain this process to me?"
result = qa({"question": question})
print(result['answer'])

In [None]:
# Code for asking questions
question = "Ok, so I need to install Tanstack and GraphQL - I'll use the graphql-request client. Will these installations create new directories within my project?"
result = qa({"question": question})
print(result['answer'])

In [None]:
# Code for asking questions
question = "So, in the Docs it shows examples for scripts to be modified in the Tanstack example section. Where are these scripts/ files which need to be modified?"
result = qa({"question": question})
print(result['answer'])