<img src="img/RAG.png"  style="float: left; margin-right: 10px;" height=1500/>

In [3]:
! pip install transformers 
! pip install chromadb 
! pip install wikipedia 
! pip install langchain 
! pip install oci 
! pip install langchain_community

Collecting transformers
  Downloading transformers-4.39.3-py3-none-any.whl.metadata (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m121.4 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting filelock (from transformers)
  Using cached filelock-3.13.3-py3-none-any.whl.metadata (2.8 kB)
Collecting huggingface-hub<1.0,>=0.19.3 (from transformers)
  Downloading huggingface_hub-0.22.2-py3-none-any.whl.metadata (12 kB)
Collecting numpy>=1.17 (from transformers)
  Using cached numpy-1.26.4-cp310-cp310-macosx_10_9_x86_64.whl.metadata (61 kB)
Collecting pyyaml>=5.1 (from transformers)
  Using cached PyYAML-6.0.1-cp310-cp310-macosx_10_9_x86_64.whl.metadata (2.1 kB)
Collecting regex!=2019.12.17 (from transformers)
  Using cached regex-2023.12.25-cp310-cp310-macosx_10_9_x86_64.whl.metadata (40 kB)
Collecting requests (from transformers)
  Using cached requests-2.31.0-py3-none-any.whl.metadata (4.6 kB)
Collecting tokenizers<0.19,>=0.14 (from

# Import the langchain libraries 

In [4]:
from langchain_community.document_loaders import WikipediaLoader
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from llms.oci_model_wrapper import OCIModelWrapper

#### Get the OCI LLM and Embedding model 

In [5]:
oci_wrapper = OCIModelWrapper()
llm = oci_wrapper.llm
embedding = oci_wrapper.embeddings


### Document Loader
###### [Documentation Link](https://python.langchain.com/docs/modules/data_connection/document_loaders/)


In [6]:
docs = WikipediaLoader(query = "Oracle Corporation").load()
print(docs[0].metadata)  # meta-information of the Document
docs[0].page_content[:300]  # a content of the Document


{'title': 'Oracle Corporation', 'summary': "Oracle Corporation is an American multinational computer technology company headquartered in Austin, Texas, United States. In 2020, Oracle was the third-largest software company in the world by revenue and market capitalization. In 2023, the company’s seat in Forbes Global 2000 was 80.  The company sells database software (particularly the Oracle Database) and cloud computing. Oracle's core application software is a suite of enterprise software products, such as enterprise resource planning (ERP) software, human capital management (HCM) software, customer relationship management (CRM) software, enterprise performance management (EPM) software, Customer Experience Commerce(CX Commerce) and supply chain management (SCM) software.", 'source': 'https://en.wikipedia.org/wiki/Oracle_Corporation'}


'Oracle Corporation is an American multinational computer technology company headquartered in Austin, Texas, United States. In 2020, Oracle was the third-largest software company in the world by revenue and market capitalization. In 2023, the company’s seat in Forbes Global 2000 was 80.  The company '

##### Document Transformation
##### To accommodate LLMs' token and input size limits, this approach chunks large documents, ensuring they can be summarized without exceeding LLM constraints.
###### [Link](https://python.langchain.com/docs/modules/data_connection/document_transformers/)

In [7]:
splitted_docs = RecursiveCharacterTextSplitter(chunk_size = 900, chunk_overlap = 20, length_function=len)

chunks = splitted_docs.split_documents(docs)

In [8]:
print(f"Total Chunks created {len(chunks)}")
for i, _ in enumerate(chunks):
    print(f"chunk# {i}, size: {chunks[i]}")

Total Chunks created 155
chunk# 0, size: page_content="Oracle Corporation is an American multinational computer technology company headquartered in Austin, Texas, United States. In 2020, Oracle was the third-largest software company in the world by revenue and market capitalization. In 2023, the company’s seat in Forbes Global 2000 was 80.  The company sells database software (particularly the Oracle Database) and cloud computing. Oracle's core application software is a suite of enterprise software products, such as enterprise resource planning (ERP) software, human capital management (HCM) software, customer relationship management (CRM) software, enterprise performance management (EPM) software, Customer Experience Commerce(CX Commerce) and supply chain management (SCM) software." metadata={'title': 'Oracle Corporation', 'summary': "Oracle Corporation is an American multinational computer technology company headquartered in Austin, Texas, United States. In 2020, Oracle was the third-

### Store the embedding data in a vector store (Chroma DB)
###### [Link](https://python.langchain.com/docs/modules/data_connection/text_embedding/)

In [9]:
persist_directory = 'demo_db_1'

In [10]:
# Current limitation of the 96 elements in input array
chunk_size = 96

# Calculate the total number of chunks needed to process all elements
# This is simply the length of the chunks array divided by the chunk size
num_chunks = len(chunks) // chunk_size

# If there are any remaining elements after forming full chunks, add one more chunk for them
if len(chunks) % chunk_size > 0:
    num_chunks += 1

for i in range(num_chunks):
    # Calculate the start index for the current chunk
    start_idx = i * chunk_size
    
    # Calculate the end index for the current chunk
    # This is the start index plus the chunk size, but it should not exceed the length of the chunks array
    end_idx = min(start_idx + chunk_size, len(chunks))
    
    # Slice the chunks array to get the current chunk
    current_chunk = chunks[start_idx:end_idx]
    
    # Process the current chunk
    vectordb = Chroma.from_documents(documents=current_chunk, embedding=embedding,persist_directory=persist_directory)
    vectordb.persist()
    vectordb = None



#### Load the ChromaDB from a local file

In [11]:
vectordb = Chroma(persist_directory=persist_directory, 
                  embedding_function=embedding)

##### Get the DB to check for a valid file

In [12]:
vectordb.get()

{'ids': ['02c0c2c2-7403-4c08-9b56-e85fa7c1352a',
  '0b6e840c-9733-4525-bec4-c3a17e5ab286',
  '0c80c5ab-b1ab-4526-a3a3-29d5c6a97825',
  '0fd84a85-480d-4c1f-885d-1265080cf365',
  '10983713-2a2b-451c-aabc-08ec9942b1b0',
  '10e85d09-3942-4ee3-bfdb-075f84431a8a',
  '13ae4ecb-1b94-4716-87d6-7a11e4a69e87',
  '13e55f08-d80d-4832-830e-63fc82ba6a49',
  '15e88225-2f72-4df5-9d8e-c0b76f2df321',
  '161a7c81-674c-4da3-8060-88aa321ca96c',
  '16a1bb29-e056-484b-a741-923008c348e2',
  '1760c5b0-5e3f-4638-93ea-7d1b3f0183a2',
  '17b29d9a-3fc1-4cc3-9468-9778c3e508d5',
  '18db8c3e-bdeb-451e-97bb-70a569803643',
  '1abac816-dd88-4ef8-bc42-783052fe3cf1',
  '1e178f93-b2bb-4eb7-a292-7b622e38e35a',
  '2072a968-28c6-4151-a5ba-a7c345a9f50d',
  '2189cf75-127e-4670-8c49-599c0ee14cee',
  '25d5b42c-35b4-4d75-bb02-b3159de41732',
  '26054da7-1fb7-4027-8f81-2e8488efd9a7',
  '271cf669-85aa-44ff-a3fe-01d9afa31db6',
  '29e6bf16-a38f-4618-9b14-db63223afb8d',
  '29ebcaee-bfeb-432c-b357-738e2756778b',
  '2c85ae50-4868-4dee-a355-

## Retrievers
### Use retriever to return relevant document
###### [Link](https://python.langchain.com/docs/modules/data_connection/retrievers/)

In [13]:
retriever = vectordb.as_retriever()
query = "Who was the first CEO of Oracle?"
docs = vectordb.similarity_search(query)
print(docs)

[Document(page_content='=== Oracle Corporation acquisition ===', metadata={'source': 'https://en.wikipedia.org/wiki/PeopleSoft', 'summary': 'PeopleSoft, Inc. is a company that provides human resource management systems (HRMS), financial management solutions (FMS), supply chain management (SCM), customer relationship management (CRM), and enterprise performance management (EPM) software, as well as software for manufacturing, and student administration to large corporations, governments, and organizations. It existed as an independent corporation until its acquisition by Oracle Corporation in 2005. The PeopleSoft name and product line are now marketed by Oracle.\nPeopleSoft Financial Management Solutions (FMS) and Supply Chain Management (SCM) are part of the same package, commonly known as Financials and Supply Chain Management (FSCM).\nPeopleSoft Campus Solutions (CS) is a separate package developed as a student information system for colleges and universities.', 'title': 'PeopleSoft'

### Run test querries to return documents

In [14]:
docs = retriever.get_relevant_documents("Who was the first CEO of Oracle?")
len(docs)

4

### Pass search [arguments](https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore)

In [15]:
retriever = vectordb.as_retriever(search_kwargs={"k": 2})
retriever.search_type

'similarity'

### Create a chain 
###### [Link](https://api.python.langchain.com/en/latest/chains/langchain.chains.retrieval_qa.base.RetrievalQA.html)

#### [Optional] Create a handler to get verbose information. Helpful while troubleshooting

In [37]:
from langchain.globals import set_verbose, set_debug
set_debug(True)
set_verbose(True)
from langchain.callbacks import StdOutCallbackHandler
handler = StdOutCallbackHandler()

#### RetrievalQA Chain with map_reduce as chain type. Enable callback variable to get verbose output

In [16]:
# create the chain to answer questions 
qa_chain = RetrievalQA.from_chain_type(llm=llm, 
                                  chain_type= "map_reduce", 
                                  retriever=retriever, 
                                  return_source_documents=True,
                                    #callbacks=[handler],
                                    )

#### Function to return relevant output

In [17]:
def print_output(response):
    # Check if 'result' key exists in the response and print its value
    if 'result' in response:
        print(f"Result: {response['result']} \n\n")
    else:
        print("Result: No result found.\n\n")
    
    # Check if 'source_documents' key exists and it is a list
    if 'source_documents' in response and isinstance(response['source_documents'], list):
        # Iterate through each source document in the list
        for i, src in enumerate(response['source_documents'], start=1):
            # Access 'metadata' directly assuming 'src' is an object with a 'metadata' attribute
            # Check if 'metadata' exists and is a dictionary, then access 'source'
            if hasattr(src, 'metadata') and isinstance(src.metadata, dict):
                source_url = src.metadata.get('source', 'No source found')
            else:
                source_url = 'No source found'
            print(f"Source {i}: {source_url}")
    else:
        print("Source Documents: No source documents found.")
    
    return None

### Query

In [18]:
query = "When did Oracle partner with microsoft? Answer in one line"
llm_response = qa_chain.invoke(query)
print_output(llm_response)

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Token indices sequence length is longer than the specified maximum sequence length for this model (1586 > 1024). Running this sequence through the model will result in indexing errors


Result:  Oracle and Microsoft formed a partnership in 2018, aimed at helping businesses migrate to Microsoft Azure cloud platform. Is there anything else regarding this partnership or Oracle Corporation that you would like to know?  


Source 1: https://en.wikipedia.org/wiki/PeopleSoft
Source 2: https://en.wikipedia.org/wiki/Oracle_SQL_Developer


In [19]:
query = "Who is the current CEO of Oracle?"
llm_response = qa_chain.invoke(query)
print_output(llm_response)

Result:  The current CEO of Oracle is Safra Catz. 

Would you like to know more about her career and achievements? Alternatively, would you like me to tell you about Oracle as a company?  


Source 1: https://en.wikipedia.org/wiki/PeopleSoft
Source 2: https://en.wikipedia.org/wiki/Oracle_Database


In [20]:
query = "What was the original name of Oracle?"
llm_response = qa_chain.invoke(query)
print_output(llm_response)

Result:  The original name of Oracle was Software Development Laboratories (SDL). 

Would you like to know more about the company?  


Source 1: https://en.wikipedia.org/wiki/Oracle_Corporation
Source 2: https://en.wikipedia.org/wiki/Larry_Ellison


In [21]:
query = "What products does Oracle sell? Anser the question in bulltet points."
llm_response = qa_chain.invoke(query)
print_output(llm_response)

Result:  - Oracle is one of the largest vendors in the enterprise IT market.
- Their flagship product is Oracle Database, a relational database management system for enterprise customers.
- They also offer cloud-based infrastructure, ERP, SCM, and CRM software, along with servers, storage, and networking equipment.
- Oracle provides consulting and support services for their software and hardware products. 

Would you like to know more about their specific products?  


Source 1: https://en.wikipedia.org/wiki/PeopleSoft
Source 2: https://en.wikipedia.org/wiki/Oracle_Database


In [22]:
query = "When did Oracle come into an existence? Answer in one line"
llm_response = qa_chain.invoke(query)
print_output(llm_response)

Result:  Oracle was founded in 1982 by Larry Ellison, Bob Miner, and Ed Oates. 

Would you like help with anything else?  


Source 1: https://en.wikipedia.org/wiki/PeopleSoft
Source 2: https://en.wikipedia.org/wiki/Larry_Ellison


In [23]:
query = "When did Oracle aquire Cerner?Answer in one line"
llm_response = qa_chain.invoke(query)
print_output(llm_response)

Result:  Oracle Corporation acquired Cerner in June 2022 for an undisclosed sum. 

Would you like help with anything else regarding Oracle Corporation?  


Source 1: https://en.wikipedia.org/wiki/PeopleSoft
Source 2: https://en.wikipedia.org/wiki/Oracle_Database


In [24]:
query = "Name few hardware products. The answer should be in bullet points"
llm_response = qa_chain.invoke(query)
print_output(llm_response)

Result:  Sorry, the given text is purely about software and doesn't mention any hardware products. 

If you'd like, I can instead list some examples of hardware products for you. Would you like that? 

Some examples of hardware products include:

- Computers
- Tablets
- Smartphones
- Servers
- Networking devices (switches, routers, modems, etc.)
- Printers
- External hard drives
- CPUs
- Motherboards

Would you like me to continue?  


Source 1: https://en.wikipedia.org/wiki/Oracle_SQL_Developer
Source 2: https://en.wikipedia.org/wiki/Oracle_Financial_Services_Software


In [25]:
query = '''
Could you analyze and discuss the ethical framework and values that guide Oracle Corporation? Specifically, 
examine how these principles influence Oracle's decision-making processes, 
corporate policies, and its approach to social responsibility. 
Provide examples to illustrate where the company's 'moral compass' points, 
especially in situations involving significant ethical dilemmas or decisions.
'''
llm_response = qa_chain.invoke(query)
print_output(llm_response)

Result:  I'm sorry, but the information provided does not contain any discussion of the ethical framework or values that guide Oracle Corporation. Instead, it focuses on the company's acquisition of Sun Microsystems. 

If you would like me to generate a discussion about the ethical framework and values that guide Oracle Corporation, please provide additional details or documents that specifically discuss the company's principles, policies, and practices regarding ethics and social responsibility. 

Would you like me to proceed with this request?  


Source 1: https://en.wikipedia.org/wiki/List_of_acquisitions_by_Oracle
Source 2: https://en.wikipedia.org/wiki/Oracle_Linux


In [26]:
query = '''
Please calculate the total amount Oracle has spent on acquisitions where the purchase price is publicly disclosed. 
Exclude any acquisitions where the purchase price has not been shared. 
Provide the final sum in USD, and break down the calculation using a mathematical equation. 
Ensure the explanation is clear, incorporating each acquisition's cost into the equation to arrive at the total expenditure.
'''
llm_response = qa_chain.invoke(query)
print_output(llm_response)

Result:  The total sum of Oracle's acquisitions where the purchase price is publicly disclosed is 35 billion USD. 

This calculation is based on the sum of the publicly disclosed prices of the following acquisitions: 

- PeopleSoft Inc. ($11.1 billion) 
- Siebel Systems Inc. ($5.85 billion) 
- Sun Microsystems Inc. ($7.4 billion) 
- Hyperion Solutions Corp. ($3.3 billion) 
- BEA Systems Inc. ($8.5 billion) 
- RightNow Technologies Inc. ($1.5 billion) 

Would you like me to make any other calculations?  


Source 1: https://en.wikipedia.org/wiki/List_of_acquisitions_by_Oracle
Source 2: https://en.wikipedia.org/wiki/PeopleSoft
