<a href="https://colab.research.google.com/github/bindusri0702/PDF-Classification/blob/main/Product_Selection_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieval-Augmented Generation for Product Selection using Groq API and Langchain

In this notebook we will be using [Groq API](https://console.groq.com), [LangChain](https://www.langchain.com/) and [Pinecone](https://www.pinecone.io/) to perform RAG. We will create vector embeddings for each product specifications pdf, store them in a vector database, retrieve the most relevent product specifications pertaining to the user prompt and include them in context for the LLM.

In [None]:
!pip install groq

Collecting groq
  Downloading groq-0.10.0-py3-none-any.whl.metadata (13 kB)
Collecting httpx<1,>=0.23.0 (from groq)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->groq)
  Downloading httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->groq)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading groq-0.10.0-py3-none-any.whl (106 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.3/106.3 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx-0.27.2-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading h11-0.14.0-py3-none-any.whl (58 kB

In [None]:
!pip install pinecone-client

Collecting pinecone-client
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Collecting pinecone-plugin-inference<2.0.0,>=1.0.3 (from pinecone-client)
  Downloading pinecone_plugin_inference-1.0.3-py3-none-any.whl.metadata (2.2 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone-client)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Downloading pinecone_client-5.0.1-py3-none-any.whl (244 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.8/244.8 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_inference-1.0.3-py3-none-any.whl (117 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.6/117.6 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_interface-0.0.7-py3-none-any.whl (6.2 kB)
Installing collected packages: pinecone-plugin-interface, pinecone-plugin-inference, pinecone-client
Successfully installed pinecone-client-5

In [None]:
!pip install langchain-community

Collecting langchain-community
  Downloading langchain_community-0.2.14-py3-none-any.whl.metadata (2.7 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting langchain<0.3.0,>=0.2.15 (from langchain-community)
  Downloading langchain-0.2.15-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core==0.2.36 (from langchain-community)
  Downloading langchain_core-0.2.36-py3-none-any.whl.metadata (6.2 kB)
Collecting langsmith<0.2.0,>=0.1.0 (from langchain-community)
  Downloading langsmith-0.1.106-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain-community)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core==0.2.36->langchain-community)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchai

In [None]:
!pip install langchain-pinecone

Collecting langchain-pinecone
  Downloading langchain_pinecone-0.1.3-py3-none-any.whl.metadata (1.7 kB)
Downloading langchain_pinecone-0.1.3-py3-none-any.whl (10 kB)
Installing collected packages: langchain-pinecone
Successfully installed langchain-pinecone-0.1.3


In [None]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.7.0


In [None]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-3.0.1


### Setup

In [None]:
import pandas as pd
import numpy as np
from groq import Groq
import os
import pinecone

from langchain_community.vectorstores import Chroma
from langchain.text_splitter import TokenTextSplitter
from langchain.docstore.document import Document
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_pinecone import PineconeVectorStore
from transformers import AutoModelForCausalLM, AutoTokenizer
from sklearn.metrics.pairwise import cosine_similarity

from IPython.display import display, HTML

GROQ_API_KEY and PINECONE_API_KEY are required for this purpose.

In [None]:
os.environ["GROQ_API_KEY"] = "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" # set this to your own GROQ API key
os.environ['PINECONE_API_KEY'] = "b04f3030-d11b-4311-b8d8-ef9b559ae255" # set this to your own PINECONE API key

In [None]:
groq_api_key = os.getenv('GROQ_API_KEY')
pinecone_api_key = os.getenv('PINECONE_API_KEY')

client = Groq(api_key = groq_api_key)
model = "mixtral-8x7b-32768"

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
df = pd.read_csv('/content/drive/MyDrive/preprocessed_test_data.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,datasheet_link,target_col,pdf_text,cleaned_text
0,0,https://lumenart.com/images/alume/awl-01_specs...,lighting,AWL.01\nSPECIFICATIONS\nMaterial\nMachined alu...,specifications material machined aluminum with...
1,1,https://lumenart.com/images/fabric/rdc/rdc_spe...,lighting,RDC Series\nSPECIFICATIONS\nConstruction\nFabr...,series specifications construction fabric lami...
2,2,https://lumenart.com/images/fabric/cyp/cyp_spe...,lighting,CYP Series\nSPECIFICATIONS\nConstruction\nFabr...,series specifications construction fabric lami...
3,3,https://lumenart.com/images/designer/wlp_specs...,lighting,WLP\nSPECIFICATIONS\nConstruction\nExtruded al...,specifications construction extruded alumiunum...
4,4,https://lumenart.com/images/designer/wcp/wcp-s...,lighting,WCP-S\nSPECIFICATIONS\nConstruction\nReal oak ...,wcps specifications construction real walnut v...


In [None]:
set(df['target_col'])

{'cable', 'fuses', 'lighting', 'others'}

In [None]:
fs = df[df['target_col']=='fuses'].cleaned_text.tolist()

In [None]:
fs[:10]

['littelfuse specifications subject change without notice revised datasheet rated mini blade fuses description mini automotive blade fuses boast miniature design that allows automakers pack more circuit protection into less space despite their light weight mini fuses perform reliably adverse environments extreme temperatures applications features benefits cars trucks suvs color coding shows amperage rating each fuse seethrough housing makes easy check whether fuse blown checkpoints make possible measure resistance without removing fuse offroad vehicles buses watercraft highcontrast amperage stamp housing aids identification simple install remove voltage rating interrupting rating recommended environmental temperature terminals material silver plated plated zinc alloy housing material flammability rating weight fuse complies with plated special purpose fuses plated recognized specifications platings temperature limit silver plating allows terminal interface ordering information part num

In [None]:
pdf_list = df['cleaned_text'].tolist()

In [None]:
pdf_text = pdf_list[0]

Hugging Face token for Mistral AI usage.

In [None]:
os.environ["HUGGINGFACE_TOKEN"] = "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" # set this to your own Hugging Face token

In [None]:
model_id = "mistralai/Mixtral-8x7B-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token="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")
#tokenizer = tiktoken.get_encoding('p50k_base')

# create the length function
def token_len(text):
    tokens = tokenizer.encode(
        text
    )
    return len(tokens)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [None]:
text_splitter = TokenTextSplitter(
    chunk_size=450, # 500 tokens is the max
    chunk_overlap=20 # Overlap of N tokens between chunks (to reduce chance of cutting out relevant connected text like middle of sentence)
)

# chunks = text_splitter.split_text(light_pdf)

# for chunk in chunks:
#     print(token_len(chunk))

In [None]:
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")


  embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
def product_selection(client, model, user_question, relevant_excerpts):
    chat_completion = client.chat.completions.create(
        messages = [
          {"role": "system", "content": """You are a product suggestion assistant. Use the provided product specifications and the
                  user's query to generate the most relevant product suggestions. Retrieve relevant documents from the knowledge
                  base first, and then use this information to inform your suggestion."""},
          {"role": "user", "content": "Based on the following product specifications:  "+
            relevant_excerpts + ", what product would you suggest for a user looking for " + user_question}

        ],
        model = model
    )

    response = chat_completion.choices[0].message.content
    return response



### Using a Vector DB to store and retrieve embeddings for all products

In [None]:
documents = []
for index, row in df[df['cleaned_text'].notnull()].iterrows():
    chunks = text_splitter.split_text(row.cleaned_text)
    total_chunks = len(chunks)
    for chunk_num in range(1,total_chunks+1):
        header = f"category: {row['target_col']}, link : {row['datasheet_link']} \n\n"
        chunk = chunks[chunk_num-1]
        documents.append(Document(page_content=header + chunk, metadata={"source": "local"}))

print(len(documents))

798


Create a pinecode Index

In [None]:
pinecone_index_name = "pdf-classification" # set this to your own index name
docsearch = PineconeVectorStore.from_documents(documents, embedding_function, index_name=pinecone_index_name)

### Use Chroma for open source option
#docsearch = Chroma.from_documents(documents, embedding_function)


In [None]:
user_question = "Suggest best fuse products and specifications and answer me why did you suggest."

In [None]:
relevent_docs = docsearch.similarity_search(user_question)
# print results
#display(HTML(relevent_docs[0].page_content))

In [None]:
relevant_excerpts = '\n\n------------------------------------------------------\n\n'.join([doc.page_content for doc in relevent_docs])
display(HTML(relevant_excerpts.replace("\n", "<br>")))

In [None]:
response = product_selection(client, model, user_question, relevant_excerpts)
display(HTML(response.replace("\n", "<br>")))