# Hackathon 2023

In this notebook, we use Langchain to index annual report documents, chunk/split them up, generate OpenAI embeddings, and insert into our Azure Cognitive Search Vector store to search and retrieve the most relevant information. This index can now be used for Retreival-Augmented-Generation applications to retrieve context needed for chatting with LLMs on our own data.

In [None]:
! pip install openai
! pip install azure-search-documents --pre
! pip install azure-identity
! pip install langchain
! pip install pypdf
! pip install python-dotenv


## Import required libraries and environment variables

In [7]:
# Import required libraries  
import openai
import os  
from dotenv import load_dotenv
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.azuresearch import AzureSearch
from azure.search.documents.indexes.models import (
    SemanticSettings,
    SemanticConfiguration,
    PrioritizedFields,
    SemanticField
)

## Configure OpenAI Settings

In [8]:
# Configure environment variables  
load_dotenv()  
openai.api_type: str = "azure"  
openai.api_key = os.getenv("AZURE_OPENAI_API_KEY")  
openai.api_base = os.getenv("AZURE_OPENAI_ENDPOINT")  
openai.api_version = os.getenv("AZURE_OPENAI_API_VERSION")  
model: str = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL") 

## Configure Vector Store Settings

In [9]:
vector_store_address: str = os.getenv("AZURE_SEARCH_SERVICE_ENDPOINT")  
vector_store_password: str = os.getenv("AZURE_SEARCH_ADMIN_KEY") 
index_name: str = "hackathon-2023-index"

## Create embeddings and vector store instances
Read your data, generate OpenAI embeddings and export to a format to insert your Azure Cognitive Search index:

In [11]:
embeddings: OpenAIEmbeddings = OpenAIEmbeddings(deployment=model, model=model, chunk_size=1, openai_api_base=os.getenv("AZURE_OPENAI_ENDPOINT"), openai_api_type="azure" )
index_name: str = "hackathon-2023-index"
vector_store: AzureSearch = AzureSearch(
    azure_search_endpoint=vector_store_address,
    azure_search_key=vector_store_password,
    index_name=index_name,
    embedding_function=embeddings.embed_query,
    semantic_configuration_name='config',
        semantic_settings=SemanticSettings(
            default_configuration='config',
            configurations=[
                SemanticConfiguration(
                    name='config',
                    prioritized_fields=PrioritizedFields(
                        title_field=SemanticField(field_name='content'),
                        prioritized_content_fields=[SemanticField(field_name='content')],
                        prioritized_keywords_fields=[SemanticField(field_name='metadata')]
                    ))
            ])
    )

In [None]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("data/")
documents = loader.load()
pages = loader.load_and_split()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

vector_store.add_documents(documents=docs)

In [15]:
# Perform a hybrid search with semantic reranking  
docs_and_scores = vector_store.semantic_hybrid_search_with_score(  
    query="What are some areas of judgement, assumptions, and accounting estimates in Agfa-Gevaert NV's consolidated financial statements, and which explanatory notes provide more information on them?",  
    k=3,  
)  
  
# Print the results  
for doc, score in docs_and_scores:  
    print("-" * 80)  
    answers = doc.metadata['answers']  
    if answers:  
        if answers.get('highlights'):  
            print(f"Semantic Answer: {answers['highlights']}")  
        else:  
            print(f"Semantic Answer: {answers['text']}")  
        print(f"Semantic Answer Score: {score}")  
    print("Content:", doc.page_content)  
    captions = doc.metadata['captions']
    print(f"Score: {score}") 
    if captions:  
        if captions.get('highlights'):  
            print(f"Caption: {captions['highlights']}")  
        else:  
            print(f"Caption: {captions['text']}")  
    else:  
        print("Caption not available") 

--------------------------------------------------------------------------------
Content: Area of judgements, assumptions and accounting estimates Explanatory notes
The discounted cash flows used for impairment testing Note 27 ‘Goodwill and intangible assets’
The useful lives of intangible assets with finite useful lives Note 27 ‘Goodwill and intangible assets’
The assessment of the adequacy of liabilities for pending  
or expected income tax audits over previous yearsNote 17 ‘Income taxes’
The recoverability of deferred tax assets Note 17 ‘Income taxes’
The actuarial assumptions used for the measurement  
of defined benefit obligationsNote 13 ‘Post-employment benefits’
Revenue recognition with regard to multiple-element arrangements Note 8 ‘Revenue’
Impairment of financial assets expected credit losses Note 22.2 ‘Expected credit losses’
5. CHANGES IN SIGNIFICANT ACCOUNTING POLICIES
Financial reporting standards applied for the first time in 2021
The consolidated statements of the Grou