<a target="_blank" href="https://github.com/castillosebastian/genai0/blob/main/exp/RAG_toy.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# RAG_toy

**Over 'FinanceBench' tiny database**:

-[MICROSOFT_2023_10K](../data/financebench/MICROSOFT_2023_10K.pdf)    
-[JOHNSON&JOHNSON_2022Q4_EARNINGS](../data/financebench/JOHNSON&JOHNSON_2022Q4_EARNINGS.pdf)   
-[Pfizer_2023Q2_10Q](../data/financebench/Pfizer_2023Q2_10Q.pdf)   
-[BESTBUY_2017_10K](../data/financebench/BESTBUY_2017_10K.pdf)   
-[BESTBUY_2019_10K](../data/financebench/BESTBUY_2019_10K.pdf)   

### Complete with Keys!:

In [1]:
OPENAI_API_TYPE='azure'
OPENAI_API_VERSION='2023-05-15'
AZURE_OPENAI_ENDPOINT='https://usesharedaopenai001.openai.azure.com/'
OPENAI_API_KEY='b82effcf491e45a088b1cd578713311c'
OPENAI_EMBEDDINGS_MODEL_NAME='text-embedding-ada-002'
SEARCH_SERVICE_ENDPOINT='https://genai0.search.windows.net'
SEARCH_SERVICE_API_KEY='lvhCA67EeE3JRyxyem5L0wGJSfOxscm2jft887ECdJAzSeDzoCNZ'

# Here cames the magic!

In [2]:
import os
import sys
import pandas as pd
import openai
import json  
import wget
import openai
from openai import AzureOpenAI
from langchain.embeddings import AzureOpenAIEmbeddings
from langchain.chat_models import AzureChatOpenAI
from langchain.docstore.document import Document
from azure.core.credentials import AzureKeyCredential 
from azure.search.documents.models import (
    QueryAnswerType,
    QueryCaptionType,
    QueryCaptionResult,
    QueryAnswerResult,
    SemanticErrorMode,
    SemanticErrorReason,
    SemanticSearchResultsType,
    QueryType,
    VectorizedQuery,
    VectorQuery,
    VectorFilterMode,    
)
from langchain.vectorstores.azuresearch import AzureSearch
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Variables-------------------------------------------------------
index   = "azure-cognitive-search-vector-demo"
os.environ["OPENAI_API_TYPE"]       = OPENAI_API_TYPE
azure_search_endpoint               = SEARCH_SERVICE_ENDPOINT
MODEL                               = "gtp35turbo-latest"
service_endpoint                    = SEARCH_SERVICE_ENDPOINT
index_name                          = index
key                                 = SEARCH_SERVICE_API_KEY
model                               = "text-embedding-ada-002" 
credential                          = AzureKeyCredential(key)
COMPLETION_TOKENS                = 1000
top_search_vector_k              = 5

# Some helper functions---------------------------------------------
def OpenAIembeddings():
    open_ai_embeddings = AzureOpenAIEmbeddings(
        azure_deployment="text-embedding-ada-002",
        openai_api_version=OPENAI_API_VERSION,
        chunk_size=1000,
    )
    return open_ai_embeddings

client = AzureOpenAI(
  api_key = OPENAI_API_KEY,  
  api_version = OPENAI_API_VERSION,
  azure_endpoint = AZURE_OPENAI_ENDPOINT
)

def generate_embeddings(text, model=model):
        return client.embeddings.create(input = [text], model=model).data[0].embedding

# Retriever------------------------------------------------------------
from azure.search.documents import SearchClient, SearchIndexingBufferedSender 
search_client = SearchClient(endpoint=service_endpoint, index_name=index_name, credential=credential)

# Ask your question

In [6]:
### Question:
# the question to ask - enable just one
QUESTION = "What is the revenue of Pfizer" 

In [7]:
vector_query = VectorizedQuery(vector=generate_embeddings(QUESTION), k_nearest_neighbors=5, fields="contentVector") 
results = search_client.search(  
    search_text=None,  
    vector_queries= [vector_query],
    include_total_count=True,
    select=["id", "company_name", "source", "doc_type", "page_content"],
)  
#results.get_count()
# Format Azure AI Search results as an ordered dictionary
from collections import OrderedDict
ordered_results = OrderedDict()
for result in results:    
    ordered_results[result['id']]={
        "score": result['@search.score'],
        "company_name": result['company_name'],
        "source": result['source'],
        "doc_type": result['doc_type'],
        "page_content": result['page_content']
    }
# From the ordered dictionary, build the "Documents" array, needed by langchain "qa_with_sources" chain
top_docs = []
for key,value in ordered_results.items():
    references = f'Company: {value["company_name"]}, SEC report: {value["doc_type"]} \n (reference doc:"{value["source"]}", reference_score:{value["score"]} - id:{key}).' 
    top_docs.append(Document(page_content=value["page_content"], metadata={"source": references}))

In [10]:
# Prepare the Open AI deployment
from langchain.chat_models import AzureChatOpenAI
llm = AzureChatOpenAI(model_name=MODEL)
from langchain.chains.qa_with_sources import load_qa_with_sources_chain


chain = load_qa_with_sources_chain(llm, chain_type='stuff')
  
response = chain({"input_documents": top_docs, "question": QUESTION, "language": "English"})

# Print the final result, including the citation(s)
from IPython.display import display, HTML, Markdown
display(HTML(f"<br/><br/><b>RAG_toybot final answer:</b>"))
display(Markdown(response['output_text']))

The revenue of Pfizer for the three months ended July 2, 2023, was $12,734 million, and for the six months ended July 2, 2023, it was $31,015 million. These figures are for the Global Biopharmaceuticals Business (Biopharma) segment. Specific revenue breakdowns for major products include $1,488 million for Comirnaty direct sales and alliance revenues, $143 million for Paxlovid, $1,762 million for Eliquis alliance revenues and direct sales, and $1,388 million for the Prevnar family. However, it should be noted that these figures are subject to change and are based on forecasts and estimates. The full revenue details can be found in Pfizer's SEC report: 10Q.
SOURCES: Company: PFIZER, SEC report: 10Q (reference doc:"../../data/financebench/Pfizer_2023Q2_10Q.pdf", reference_score:0.88448966 - id:703)