# Retrieval-Augmented Generation for Product Selection using Groq API and Langchain

In this notebook we will be using [Groq API](https://console.groq.com), [LangChain](https://www.langchain.com/) and [Pinecone](https://www.pinecone.io/) to perform RAG. We will create vector embeddings for each of the book's metadata and reviews from Amazon data, store them in a vector database, retrieve the most relevent books pertaining to the user prompt and include them in context for the LLM.

### Setup

In [1]:
import pandas as pd
import numpy as np
from groq import Groq
import os
import pinecone

from langchain_community.vectorstores import Chroma
from langchain.text_splitter import TokenTextSplitter
from langchain.docstore.document import Document
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_pinecone import PineconeVectorStore
from transformers import AutoModelForCausalLM, AutoTokenizer
from sklearn.metrics.pairwise import cosine_similarity

from IPython.display import display, HTML

  from tqdm.autonotebook import tqdm



GROQ_API_KEY and PINECONE_API_KEY are required for this purpose.

In [2]:
os.environ["GROQ_API_KEY"] = "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" # set this to your own GROQ API key
os.environ['PINECONE_API_KEY'] = "c8d2009a-e1a2-488c-a835-12bef3b8f290" # set this to your own PINECONE API key

In [3]:
groq_api_key = os.getenv('GROQ_API_KEY')
pinecone_api_key = os.getenv('PINECONE_API_KEY')

client = Groq(api_key = groq_api_key)
model = "mixtral-8x7b-32768"

In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [4]:
df = pd.read_csv(r"C:\Users\bindu\Desktop\transformed_df.csv")
df.head()

Unnamed: 0,Title,description,authors,publisher,publishedDate,categories,ratingsCount,Id,review_text,review_summary,review_time,review_helpfulness,review_score
0,Its Only Art If Its Well Hung!,,['Julie Strain'],,1996,['Comics & Graphic Novels'],0.0,1882931173,This is only for Julie Strain fans. It's a col...,Nice collection of Julie Strain images,1999-10-23 00:00:00+00:00,1.0,4.0
1,Dr. Seuss: American Icon,Philip Nel takes a fascinating look into the k...,['Philip Nel'],A&C Black,2005-01-01,['Biography & Autobiography'],0.0,826414346,I don't care much for Dr. Seuss but after read...,Really Enjoyed It__Essential for every persona...,2009-01-06 00:00:00+00:00,0.695477,4.555556
2,Wonderful Worship in Smaller Churches,This resource includes twelve principles in un...,['David R. Ray'],,2000,['Religion'],0.0,829814000,"I just finished the book, &quot;Wonderful Wors...",Outstanding Resource for Small Church Pastors_...,2010-12-08 00:00:00+00:00,0.95,5.0
3,Whispers of the Wicked Saints,Julia Thomas finds her life spinning out of co...,['Veronica Haddon'],iUniverse,2005-02,['Fiction'],0.0,595344550,I bought this book because I read some glowing...,not good__Here is my opinion__Buyer beware__Fa...,2006-07-01 00:00:00+00:00,0.451261,3.71875
4,"Nation Dance: Religion, Identity and Cultural ...",,['Edward Long'],,2003-03-01,[],0.0,253338352,from publisher:Addresses the interplay of dive...,interplay of traditions across Caribbean,2008-02-04 00:00:00+00:00,1.0,5.0


In [5]:
df.columns

Index(['Title', 'description', 'authors', 'publisher', 'publishedDate',
       'categories', 'ratingsCount', 'Id', 'review_text', 'review_summary',
       'review_time', 'review_helpfulness', 'review_score'],
      dtype='object')

Hugging Face token for Mistral AI usage.

In [6]:
os.environ["HUGGINGFACE_TOKEN"] = "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" # set this to your own Hugging Face token

In [7]:
model_id = "mistralai/Mixtral-8x7B-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token="hf_oEHDWBTgxtMvbqomvsXpOQsTGXFbdWY")

# create the length function
def token_len(text):
    tokens = tokenizer.encode(
        text
    )
    return len(tokens)



In [8]:
text_splitter = TokenTextSplitter(
    chunk_size=500, # 500 tokens is the max
    chunk_overlap=20 
)

In [13]:
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

  embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")





In [30]:
def book_selection(client, model, user_question, relevant_excerpts):
    chat_completion = client.chat.completions.create(
        messages = [
            {
                "role": "system",
                "content": """You are an intelligent agent that recommends books. The users will be 
                providing the type of the books they wanted to read. You need to suggest books based on their choices.
                Use only the information provided in the database, if you can't find from database, just say you "don't know".
                Don't assume things and make up the answers."""
            },
            {
                "role": "user",
                "content": "User Question: " + user_question + "\n\nSuggested books:\n\n" + relevant_excerpts,
            }
        ],
        model = model
    )

    response = chat_completion.choices[0].message.content
    return response



In [10]:
# Combine relevant fields into a single document string
df['document'] = df.apply(lambda row: f"""Title: {row['Title']}, Authors: {row['authors']},
                          PublishedIn: {row['publishedDate']}, Rating: {row['review_score']}, Publisher: {row['publisher']} 
                          Categories: {row['categories']}, Description: {row['description']}, Review: {row['review_text']}""", axis=1)



In [None]:
documents = []
for index, row in df.iterrows():
    chunks = text_splitter.split_text(row.document)
    total_chunks = len(chunks)
    for chunk_num in range(1,total_chunks+1):
        #header = f"category: {row['target_col']}\n\n" header +
        chunk = chunks[chunk_num-1]
        documents.append(Document(page_content= chunk, metadata={"source": "local"}))

print(len(documents))

Create a pinecode Index

In [10]:
pinecone_index_name = "book-recommendation" # set this to your own index name
#docsearch = PineconeVectorStore.from_documents(documents, embedding_function, index_name=pinecone_index_name)

### Use Chroma for open source option
#docsearch = Chroma.from_documents(documents, embedding_function)


In [11]:
from pinecone.grpc import PineconeGRPC as Pinecone

pc = Pinecone(api_key=pinecone_api_key)
index = pc.Index(pinecone_index_name)

In [41]:
query_text = "Suggest a book by Thomas hardy with a rating of 4.5 and above in genre drama." #published after 2015
query_embedding = embedding_function.embed_query(query_text)

In [42]:
matched_items = index.query(
    #namespace="example-namespace",
    vector=query_embedding,
    top_k=5,
    include_values=True,
    include_metadata=True
)

In [45]:
relevant_excerpts = '\n\n------------------------------------------------------\n\n'.join([match['metadata']['text'] for match in matched_items['matches']])

In [44]:
response = book_selection(client, model, query_text, relevant_excerpts)
display(HTML(response.replace("\n", "<br>")))