<center><h1>Retrieval-Augmented Generation (RAG) Pipeline for Market Research and Analytics</h1></center>

We intend to deploy RAG-based AI for streamlining market research and data analysis. This system uses __'Llama-2-7B'__ as the underlying LLM and __'HuggingFace Sentence Transformers'__ as the embedding model. The system utilises document retrieval-based text generation, with source citations. This produces trustworthy and credible responses, thus eliminating hallucinations, which is a major cause of concern with the Large Language Models.

> #### __Creating the virtual environment and installing dependencies__

Creating a virtual environment for every new project is very crucial to avoid any clashes with the dependencies of the software packages. Next, we install the list of software packages and dependencies from the text file `requirements.txt` present in the same project directory.

In [5]:
# new virtual environment named '.venv'
#!python -m venv RAG    # uncomment code if needed
#!RAG\Scripts\activate  # uncomment code if needed

# install dependencies
#!pip install -r requirements.txt   # uncomment code if needed

> #### __The LLM : Llama2-7B-chat.Q6_K__

* Open-source model, developed by 'Meta'.
* Downloads last month : 962,379
* Collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters
* Use governed by the Meta license
* Meta developed and publicly released the Llama 2 family of LLMs
* The said model is optimized for dialogue use cases, outperform open-source chat models on most benchmarks and human evaluations on helpfulness and safety
* Model architecture is an auto-regressive language model that uses an optimized transformer architecture, tuned versions use supervised fine-tuning (SFT) and reinforcement-learning with human feedback (RLHF) to align to human preferences for helpfulness and safety
* Training data – a new mix of publicly available online data, 2 trillion tokens, includes publicly available instruction datasets as well as over one million new human-annotated examples, neither pretraining nor fine-tuning datasets include Meta user data
* Data freshness – cutoff date is September 2022, some tuning data is more recent, upto July 2023
* Parameters – 7B
* Content length – 4k
* Training duration – Jan 2023 to July 2023
* Original paper : Llama2 : Open Foundation and Fine-tuned Chat Models. [Click here to read the paper.](https://arxiv.org/abs/2307.09288)

We are running the model locally, using the Python bindings for the C++ library `Llama-CPP`

Here is the link to the model card : [Llama2-7B-chat.Q6_K](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF)

In [1]:
from llama_index.llms import LlamaCPP

model_path = r"C:\0-VARAD-DESHMUKH\models\llama-2-7b-chat.Q6_K.gguf"

llm = LlamaCPP(
    model_path=model_path,
    temperature=0,
    # kwargs to pass to __call__()
    generate_kwargs={},
    verbose=True,
)

AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | 
Model metadata: {'general.name': 'LLaMA v2', 'general.architecture': 'llama', 'llama.context_length': '4096', 'llama.rope.dimension_count': '128', 'llama.embedding_length': '4096', 'llama.block_count': '32', 'llama.feed_forward_length': '11008', 'llama.attention.head_count': '32', 'tokenizer.ggml.eos_token_id': '2', 'general.file_type': '18', 'llama.attention.head_count_kv': '32', 'llama.attention.layer_norm_rms_epsilon': '0.000001', 'tokenizer.ggml.model': 'llama', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '1', 'tokenizer.ggml.unknown_token_id': '0'}


> #### __The Embedding Model : BAAI/bge-large-en-v1.5__

Open-source embedding model, developed by Beijing Academy of Artificial Intelligence. We download the model from the HuggingFace Model Hub. 

Here is the link to the model card : [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5)

In [2]:
# sentence transformers
from llama_index.embeddings import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")

  from .autonotebook import tqdm as notebook_tqdm
config.json: 100%|██████████| 779/779 [00:00<?, ?B/s] 
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
model.safetensors: 100%|██████████| 1.34G/1.34G [01:00<00:00, 22.2MB/s]
tokenizer_config.json: 100%|██████████| 366/366 [00:00<?, ?B/s] 
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 633kB/s]
tokenizer.json: 100%|██████████| 711k/711k [00:00<00:00, 949kB/s]
special_tokens_map.json: 100%|██████████| 125/125 [00:00<00:00, 137kB/s]


> #### __Define Service Context__

We need to configure the system by specifying the llm and the embedding model.

In [3]:
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model
)

> #### __Load the Data__

We first load the `PDFReader` from `LlamaHub`, specify the path to our PDF document and construct the Document object.

In [4]:
# optionl
import os
static_dir = 'static/'
if not os.path.exists(static_dir):
   os.mkdir(static_dir)

# loading the PDFReader from llama_index hub
from llama_index import download_loader

# PyMuReader
PyMuPDFReader = download_loader('PyMuPDFReader')
loader = PyMuPDFReader()

pdf_path = r"C:\0-VARAD-DESHMUKH\Files\data\Report.pdf"

documents = loader.load_data(
    file_path=pdf_path,
    metadata=True
)

> #### __Local Vector Store__

After we load the documents, we setup a local Vector Index store, for storing the vector embeddings of the source documents as well as the query, for efficient semantic search.

In [7]:
index = VectorStoreIndex.from_documents(
        documents,
        show_progress=True,
        service_context=service_context
    )
    # store it for later
index.storage_context.persist(persist_dir=PERSIST_DIR)

Parsing nodes: 100%|██████████| 27/27 [00:00<00:00, 1265.96it/s]
Generating embeddings: 100%|██████████| 27/27 [01:13<00:00,  2.72s/it]


In [5]:
import os.path
from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
    load_index_from_storage,
    get_response_synthesizer
)
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine

document_directory = r"C:\0-VARAD-DESHMUKH\Files\data"

# check if storage already exists
PERSIST_DIR = r"C:\0-VARAD-DESHMUKH\storage"
if not os.path.exists(PERSIST_DIR):
    # load the documents and create the index
    documents = SimpleDirectoryReader(input_dir=document_directory).load_data()
    index = VectorStoreIndex.from_documents(
        documents,
        show_progress=True,
        service_context=service_context
    )
    # store it for later
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    # load the existing index
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)

ValueError: 
******
Could not load OpenAI model. If you intended to use OpenAI, please check your OPENAI_API_KEY.
Original error:
No API key found for OpenAI.
Please set either the OPENAI_API_KEY environment variable or openai.api_key prior to initialization.
API keys can be found or created at https://platform.openai.com/account/api-keys

To disable the LLM entirely, set llm=None.
******

In [8]:

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2,
    service_context=service_context
)

# configure response synthesizer
response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    service_context=service_context
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)


> #### __The Retrieval Pipeline__

We now build a retrieval pipeline. We write a query in the form of a prompt, convert it into an embedding as well and then query the vector database based on that query. This query will just retrieve the relevant nodes from the source document. These filtered out nodes will subsequently be used to answer the user queries based on that retrieved material.

In [13]:
# query

prompt = '''
What model alternatives are being considered? Answer strictly according to the source document. Keep the tone technical. Answer strictly in 100 words.'''

response = query_engine.query(prompt)
print(response)

Llama.generate: prefix-match hit



The alternative models being considered for fine-tuning are BERT, RoBERTa, and DistilBERT. These models have shown promising results in various NLP tasks and are widely used in the field. However, due to their complexity and computational requirements, they are not feasible for our specific use case.


In [14]:
print(response.source_nodes[0].get_content())

9 | R A G - b a s e d  A I  s o l u t i o n  f o r  I n v e s t m e n t  R e s e a r c h  a n d  A n a l y t i c s  
 
So, our take is that for our specific use case, fine-tuning a model is not a feasible option. We need 
something which is cost-efficient and saves time and resources. So, we look at the other, widely used 
alternative, retrieval-based approaches. We take this up next.
