# Introduction to Retrieval Augmented Generation (RAG) with LangChain

This Jupyter Notebook provides a practical guide to understanding and implementing Retrieval Augmented Generation (RAG) using LangChain. We'll explore the core functionalities of RAG, vector databases, and provide examples of retrieval in action.

## Introduction to RAG in LangChain
Retrieval Augmented Generation (RAG) is a technique that enhances the capabilities of language models by integrating external data during the generation process. It allows language models to access and utilize information that was not included in their training data, thus enabling more accurate and relevant responses, especially for domain-specific or up-to-date information. In RAG, external data is retrieved and then passed to the LLM during the generation step.

LangChain provides a comprehensive suite of tools for building RAG applications, ranging from simple to complex. This documentation section covers everything related to the retrieval step, including data fetching and processing.

![Illustrative Diagram: Data Connection Process](images/retrieve.png)
*Illustrative diagram showing the data connection process with steps: Source, Load, Transform, Embed, Store, and Retrieve.*

## Document Loaders

Document loaders in LangChain facilitate loading documents from a variety of sources. LangChain integrates with over 100 different document loaders and major providers like AirByte and Unstructured. It supports loading all types of documents (HTML, PDF, code) from diverse locations (private S3 buckets, public websites).

## Text Splitting

A crucial aspect of retrieval is fetching only the relevant parts of documents. This requires several transformation steps to prepare the documents for retrieval. LangChain includes several algorithms for splitting large documents into smaller chunks and is optimized for specific document types (code, markdown, etc).

## Text Embedding Models

Creating embeddings for documents is another vital part of the retrieval process. Embeddings capture the semantic meaning of text, enabling efficient searching for similar text pieces. LangChain offers integrations with over 25 different embedding providers and methods, from open-source to proprietary APIs. It provides a standardized interface for easy swapping between models.

## Vector Stores

With the increasing importance of embeddings, there is a need for databases that efficiently store and search these embeddings. LangChain integrates with over 50 different vector stores, ranging from open-source local databases to cloud-hosted proprietary ones. This flexibility allows for the selection of a database best suited to specific needs.

## Retrievers

After storing data in a database, the next step is retrieval. LangChain supports various retrieval algorithms, adding significant value in this area. LangChain's retrieval methods range from basic semantic search to more advanced algorithms:

- **Parent Document Retriever**: Creates multiple embeddings per parent document, enabling retrieval of smaller chunks while returning larger context.
- **Self Query Retriever**: Parses user questions to separate semantic content from metadata filters, allowing more targeted retrieval.
- **Ensemble Retriever**: Facilitates document retrieval from multiple sources or using different algorithms.

## Indexing

LangChain's Indexing API streamlines data syncing from any source into a vector store. It helps to avoid duplication, unnecessary rewriting, and recomputation of embeddings, saving time, reducing costs, and improving vector search results.


## Import necessary libraries

In [9]:
import os
import json
import pandas as pd
import numpy as np
import re
import os

from langchain.document_loaders import PyMuPDFLoader, PyPDFLoader, PDFMinerLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.prompts import PromptTemplate
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.embeddings import HuggingFaceHubEmbeddings
from langchain.llms import OpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 1. Document Loaders

Document loaders are essential components in LangChain, enabling the loading of data from various sources into a structured format known as `Document`. A `Document` typically consists of text and its associated metadata, facilitating the processing and analysis of data from diverse sources. This introduction will cover the functionality of document loaders and their role in data processing within LangChain.

## What are Document Loaders?

Document loaders in LangChain are designed to import data from a variety of sources and convert them into `Document` objects. These objects are structured to include the text content along with relevant metadata, making them ideal for further processing in NLP pipelines, such as Retrieval Augmented Generation (RAG) or other language model applications.

### Types of Document Loaders

LangChain supports a range of document loaders, each tailored to handle specific types of data sources:

1. **Text File Loader**: For loading plain text data from `.txt` files. This loader is useful for processing documents, books, or any other textual data stored in text file format.

2. **Web Page Loader**: This loader is capable of fetching and extracting text content from web pages. It's particularly useful for scraping and processing data from the internet.

3. **YouTube Transcript Loader**: A specialized loader for extracting transcripts from YouTube videos. This allows for the processing of spoken content in video format, turning it into text for analysis.

## Using Document Loaders

Document loaders in LangChain provide a simple and efficient way to load data into memory. They typically offer the following functionalities:

### Standard Load Method

The standard `load` method is used to directly load data from the source into memory. This method is straightforward and suitable for scenarios where immediate access to all data is required.

Example usage:

```python
from langchain_community.document_loaders import TextLoader

# Load a text file
text_loader = TextLoader('path/to/textfile.txt')
documents = text_loader.load()


## Text loader

In [10]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/algebra_news.txt")
document=loader.load()

In [11]:
print(document[0].page_content)

Algebra became University for its 25th birthday

On its 25th birthday, Algebra completed the accreditation process and became a University - Zagreb, December 21, 2023

Algebra University started three additional professional study programs in the current academic year 2023-2024, with a record number of over 2,000 enrolled students. Three new university studies will start at the beginning of the next academic year.

Algebra, having received a permit and a positive accreditation recommendation from the Agency for Science and Higher Education and a final positive opinion, has formally become the first private university in Croatia to focus on technical and digital studies.

The University of Zagreb carried the mentoring program for Algebra University through its component Faculty of Electrical Engineering and Computing.

Algebra University's mission is to develop a dynamic and innovative learning environment that empowers students and attendees to be leaders in the development and applica

In [12]:
# Custom function to format and print the content of each document
def format_and_print_content(docs):
    for doc in docs:
        # Splitting the content into paragraphs
        paragraphs = doc.page_content.split('\n\n')

        # Printing each paragraph with a line space for readability
        print("-" * 50)  # Separator for each document
        for paragraph in paragraphs:
            print(paragraph)
            print("\n")  # Adding a space after each paragraph

# Format and print the content of the documents
format_and_print_content(document)

--------------------------------------------------
Algebra became University for its 25th birthday


On its 25th birthday, Algebra completed the accreditation process and became a University - Zagreb, December 21, 2023


Algebra University started three additional professional study programs in the current academic year 2023-2024, with a record number of over 2,000 enrolled students. Three new university studies will start at the beginning of the next academic year.


Algebra, having received a permit and a positive accreditation recommendation from the Agency for Science and Higher Education and a final positive opinion, has formally become the first private university in Croatia to focus on technical and digital studies.


The University of Zagreb carried the mentoring program for Algebra University through its component Faculty of Electrical Engineering and Computing.


Algebra University's mission is to develop a dynamic and innovative learning environment that empowers students an

## CSV loader

In [13]:
from langchain_community.document_loaders.csv_loader import CSVLoader


loader = CSVLoader(file_path='data/mlb_teams_2012.csv')
data = loader.load()

RuntimeError: Error loading data/mlb_teams_2012.csv

## PDF loader

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("data/student_guide.pdf")
documents = loader.load_and_split()

In [None]:
print(documents)

[Document(page_content='T: 00385 1 2222 182\nE: student@algebra.hr\nW: www.algebra.university\nSTUDENT \nGUIDE   \n2023/2024', metadata={'source': 'data/student_guide.pdf', 'page': 0}), Document(page_content='INTRODUCTION  2\nTABLE OF CONTENTS 3\nAbout Algebra University College  4\nCampus orientation 6\nPlan of Algebra University College premises  8\nCode of conduct  10\nStudy structure  14\nAcademic year 2023/2024 calendar  15\nWhat do I need to know as a student of \nAlgebra University College?  16\nInfoeduka – the digital student service or \ndigital notice board 16\nStudent status  17\nWhat are the conditions for enrolment in a \nhigher year of study? 17\nWhat is the semester status?  18\nWhat is X-ica and its purpose? 18\nECTS Credits and Learning Outcomes 19\nWhat are ECTS credits? 19\nWhat are learning outcomes? 19\nClasses organisation and attendance  20\nWhy is it important to attend classes? 20\nWhat are additional classes? 20\nEXAMINATIONS  21\nHow to pass an exam?  21\nWha

In [None]:
print(documents[1].page_content)

INTRODUCTION  2
TABLE OF CONTENTS 3
About Algebra University College  4
Campus orientation 6
Plan of Algebra University College premises  8
Code of conduct  10
Study structure  14
Academic year 2023/2024 calendar  15
What do I need to know as a student of 
Algebra University College?  16
Infoeduka – the digital student service or 
digital notice board 16
Student status  17
What are the conditions for enrolment in a 
higher year of study? 17
What is the semester status?  18
What is X-ica and its purpose? 18
ECTS Credits and Learning Outcomes 19
What are ECTS credits? 19
What are learning outcomes? 19
Classes organisation and attendance  20
Why is it important to attend classes? 20
What are additional classes? 20
EXAMINATIONS  21
How to pass an exam?  21
What are mid-term exams and how are they 
organised?  21
Examination periods  22
How to register and cancel mid-term exams 
and exams?  22
When has a student failed an exam?  23
How many times can an exam be taken?  24
What is the additi

## Word doc loader

In [14]:
from langchain.document_loaders import Docx2txtLoader
loader = Docx2txtLoader("data/docker.docx")
documents = loader.load()

In [15]:
print(documents[0].page_content)

Install Docker Desktop on Windows

This page contains the download URL, information about system requirements, and instructions on how to install Docker Desktop for Windows.

Docker Desktop for Windows

For checksums, see Release notes

Docker Desktop terms

Commercial use of Docker Desktop in larger enterprises (more than 250 employees OR more than $10 million USD in annual revenue) requires a paid subscription.

System requirements

WSL 2 backend Hyper-V backend and Windows containers



WSL version 1.1.3.0 or later.

Windows 11 64-bit: Home or Pro version 21H2 or higher, or Enterprise or Education version 21H2 or higher.

Windows 10 64-bit:

We recommend Home or Pro 22H2 (build 19045) or higher, or Enterprise or Education 22H2 (build 19045) or higher.

Minimum required is Home or Pro 21H2 (build 19044) or higher, or Enterprise or Education 21H2 (build 19044) or higher.

Turn on the WSL 2 feature on Windows. For detailed instructions, refer to the Microsoft documentation.

The follow

## Chek/filter files in data folder 

In [16]:
# list only pdf file paths in data folder
pdf_file_paths = [os.path.join('data/', f) for f in os.listdir('data/') if f.endswith('.pdf')]
pdf_file_paths

['data/call_admission_20_21.pdf', 'data/student_guide.pdf']

In [17]:
# list only pdf files in "../data/ foler
pdf_files = os.listdir("data/")
pdf_files = [f for f in pdf_files if f.endswith(".pdf")]
print(pdf_files)


['call_admission_20_21.pdf', 'student_guide.pdf']


In [18]:
# list only docx files in "../data/ foler
doc_files = os.listdir("data/")
doc_files = [f for f in doc_files if f.endswith(".docx")]
print(doc_files)

['docker.docx']


In [19]:
# list only pptx files in "../data/ foler
ppt_files = os.listdir("data/")
ppt_files = [f for f in ppt_files if f.endswith(".pptx")]
print(ppt_files)

['ATTVSP_2023_Lecture_Multimodal_Data_Intelligence.pptx']


# 2. Text Splitters in LangChain

After loading documents, it's often necessary to transform them to better suit specific applications. A common transformation is splitting a long document into smaller chunks that fit within the context window of your model. LangChain provides a variety of built-in document transformers for splitting, combining, filtering, and manipulating documents.

## The Necessity of Text Splitting

Handling long text pieces requires splitting them into manageable chunks. While this might seem straightforward, there's considerable complexity involved. The goal is to keep semantically related text together, and what constitutes 'semantically related' can vary depending on the type of text.

### How Text Splitters Work:

1. **Split Text**: Break the text into small, semantically meaningful units (often sentences).
2. **Combine Chunks**: Aggregate these small chunks into larger ones until reaching a certain size limit (determined by a specific function).
3. **Create Overlapping Chunks**: Once the size limit is reached, the chunk becomes a separate piece of text, and a new chunk begins, with some overlap to maintain context.

### Customization Axes:

- **Text Splitting Method**: How the text is initially divided into smaller units.
- **Chunk Size Measurement**: How the size of the chunks is determined.

## Types of Text Splitters in LangChain

LangChain includes various text splitters, each with unique characteristics. Here's an overview:

| Name | Splits On | Adds Metadata | Description |
|------|-----------|---------------|-------------|
| Recursive | User-defined characters | No | Recursively splits text, aiming to keep related pieces together. Recommended for initial text splitting. |
| HTML | HTML characters | Yes | Splits based on HTML characters, adding metadata about chunk origins. |
| Markdown | Markdown characters | Yes | Splits based on Markdown characters, adding metadata about chunk origins. |
| Code | Language-specific characters | No | Splits text based on coding language syntax. Supports 15 languages. |
| Token | Tokens | No | Splits text based on tokens. Offers different methods for measuring tokens. |
| Character | User-defined character | No | Splits text based on a specified character. A simpler method. |
| [Experimental] Semantic Chunker | Sentences | No | Initially splits on sentences, then combines adjacent ones if semantically similar. Developed by Greg Kamradt. |

## Evaluating Text Splitters

To assess text splitters, you can use the Chunkviz utility by Greg Kamradt. Chunkviz visualizes text splitting and aids in fine-tuning the splitting parameters.

## Other Document Transformations

Text splitting is just one transformation type for document preparation. LangChain integrates with third-party tools for additional document transformations. Head to the Integrations section for more details on these tools.


# Recursive Character Text Splitter in LangChain

When dealing with large documents, it's often necessary to split the text into smaller, more manageable chunks. LangChain's `RecursiveCharacterTextSplitter` is an ideal tool for this task, especially for generic text. It recursively splits text based on a list of characters, with the goal of keeping semantically related pieces of text together.

## How It Works

- **How Text Is Split**: The text is split based on a list of characters. The default list is `["\n\n", "\n", " ", ""]`.
- **Chunk Size Measurement**: The size of each chunk is measured by the number of characters.

## Example: Splitting a Document

First, we'll read a long document and then use the `RecursiveCharacterTextSplitter` to split it into smaller chunks.


## RecursiveCharacterTextSplitter

In [20]:
# Read the document
with open("data/algebra_news.txt") as f:
    algebra = f.read()

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,  # Set a small chunk size for demonstration
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

# Create documents from the text
texts = text_splitter.create_documents([algebra])

for i, text in enumerate(texts):
    # Access the page_content attribute for length calculation
    chunk_length = len(text.page_content)
    print(f"{i}.Chunk has length {chunk_length} and the chunk content is: {text.page_content}")


0.Chunk has length 166 and the chunk content is: Algebra became University for its 25th birthday

On its 25th birthday, Algebra completed the accreditation process and became a University - Zagreb, December 21, 2023
1.Chunk has length 199 and the chunk content is: Algebra University started three additional professional study programs in the current academic year 2023-2024, with a record number of over 2,000 enrolled students. Three new university studies will
2.Chunk has length 62 and the chunk content is: studies will start at the beginning of the next academic year.
3.Chunk has length 194 and the chunk content is: Algebra, having received a permit and a positive accreditation recommendation from the Agency for Science and Higher Education and a final positive opinion, has formally become the first private
4.Chunk has length 82 and the chunk content is: the first private university in Croatia to focus on technical and digital studies.
5.Chunk has length 148 and the chunk content is: 

## CharacterTextSplitter

In [21]:
# Read the document
with open("data/algebra_news.txt") as f:
    algebra = f.read()

from langchain.text_splitter import CharacterTextSplitter

# Initialize the text splitter
text_splitter = CharacterTextSplitter(
    # separator="\n\n",
    chunk_size=200,  # Set a small chunk size for demonstration
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

# Create documents from the text
texts = text_splitter.create_documents([algebra])

for i, text in enumerate(texts):
    # Access the page_content attribute for length calculation
    chunk_length = len(text.page_content)
    print(f"{i}.Chunk has length {chunk_length} and the chunk content is: {text.page_content}")


Created a chunk of size 249, which is longer than the specified 200
Created a chunk of size 259, which is longer than the specified 200
Created a chunk of size 494, which is longer than the specified 200
Created a chunk of size 340, which is longer than the specified 200
Created a chunk of size 702, which is longer than the specified 200
Created a chunk of size 205, which is longer than the specified 200
Created a chunk of size 497, which is longer than the specified 200
Created a chunk of size 511, which is longer than the specified 200
Created a chunk of size 1157, which is longer than the specified 200
Created a chunk of size 426, which is longer than the specified 200
Created a chunk of size 225, which is longer than the specified 200
Created a chunk of size 301, which is longer than the specified 200
Created a chunk of size 367, which is longer than the specified 200


0.Chunk has length 166 and the chunk content is: Algebra became University for its 25th birthday

On its 25th birthday, Algebra completed the accreditation process and became a University - Zagreb, December 21, 2023
1.Chunk has length 249 and the chunk content is: Algebra University started three additional professional study programs in the current academic year 2023-2024, with a record number of over 2,000 enrolled students. Three new university studies will start at the beginning of the next academic year.
2.Chunk has length 259 and the chunk content is: Algebra, having received a permit and a positive accreditation recommendation from the Agency for Science and Higher Education and a final positive opinion, has formally become the first private university in Croatia to focus on technical and digital studies.
3.Chunk has length 148 and the chunk content is: The University of Zagreb carried the mentoring program for Algebra University through its component Faculty of Electrical Engin

## Semantic Chunking

In [22]:
# !pip install --quiet langchain_experimental
# !pip install --upgrade --quiet  langchain sentence_transformers
# Hugging Face Inference API
# We can also access embedding models via the Hugging Face Inference API, which does not require us to install sentence_transformers and download models locally
# https://python.langchain.com/docs/integrations/text_embedding/huggingfacehub

In [23]:
!pip install huggingface_hub



In [24]:
from langchain_community.embeddings import HuggingFaceHubEmbeddings
embeddings = HuggingFaceHubEmbeddings()
text = "This is a test document."
query_result = embeddings.embed_query(text)
query_result[:3]

[-0.04895181208848953, -0.03986206278204918, -0.021562788635492325]

In [25]:
# Read the document
with open("data/algebra_news.txt") as f:
    algebra = f.read()

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings

# Initialize the text splitter
text_splitter = SemanticChunker(HuggingFaceEmbeddings())

In [26]:
# Create documents from the text
documents = text_splitter.create_documents([algebra])

for i, doc in enumerate(documents):
    # Access the page_content attribute for length calculation
    chunk_length = len(doc.page_content)
    print(f"{i}.Chunk has length {chunk_length} and the chunk content is: {doc.page_content}")


0.Chunk has length 1133 and the chunk content is: Algebra became University for its 25th birthday

On its 25th birthday, Algebra completed the accreditation process and became a University - Zagreb, December 21, 2023

Algebra University started three additional professional study programs in the current academic year 2023-2024, with a record number of over 2,000 enrolled students. Three new university studies will start at the beginning of the next academic year. Algebra, having received a permit and a positive accreditation recommendation from the Agency for Science and Higher Education and a final positive opinion, has formally become the first private university in Croatia to focus on technical and digital studies. The University of Zagreb carried the mentoring program for Algebra University through its component Faculty of Electrical Engineering and Computing. Algebra University's mission is to develop a dynamic and innovative learning environment that empowers students and attende

In [27]:
print(documents[0].page_content)

Algebra became University for its 25th birthday

On its 25th birthday, Algebra completed the accreditation process and became a University - Zagreb, December 21, 2023

Algebra University started three additional professional study programs in the current academic year 2023-2024, with a record number of over 2,000 enrolled students. Three new university studies will start at the beginning of the next academic year. Algebra, having received a permit and a positive accreditation recommendation from the Agency for Science and Higher Education and a final positive opinion, has formally become the first private university in Croatia to focus on technical and digital studies. The University of Zagreb carried the mentoring program for Algebra University through its component Faculty of Electrical Engineering and Computing. Algebra University's mission is to develop a dynamic and innovative learning environment that empowers students and attendees to be leaders in the development and applicatio

# Tasks

### Task 1: Exploring Document Loaders
**Objective**: Familiarize with various document loaders in LangChain.

**Task**:
- Write a script to load different types of documents (e.g., plain text, pdfs) using appropriate document loaders.
- For each type, load a sample document and display its content and metadata.

### Task 2: Implementing Custom Text Splitter
**Objective**: Understand and implement custom text splitting logic.

**Task**:
- Create a text splitter that splits documents.
- Test this splitter on a sample document and analyze the output.

### Task 3: Recursive Character Text Splitting
**Objective**: Dive deeper into the RecursiveCharacterTextSplitter.

**Task**:
- Use the RecursiveCharacterTextSplitter to split a lengthy document into smaller chunks.
- Experiment with different chunk sizes and overlaps.
- Visualize the splitting impact by printing the first few chunks of the split document.

### Task 4: Text Splitting and Chunk Analysis
**Objective**: Analyze the effectiveness of different text splitters.

**Task**:
- Use at least two different types of text splitters (e.g., Semantic Chunker and Token Splitter) on the same document.
- Compare the outputs in terms of chunk coherence and relevance.
- Discuss the pros and cons of each splitter type based on your observations.

# 3. Exploring Vector Databases

Vector stores and vector databases play a pivotal role in the efficient storage and retrieval of unstructured data, particularly in the context of Retrieval Augmented Generation (RAG). In LangChain, these components are crucial for embedding and searching over data at scale. This introduction will provide an overview of vector stores their functionality, and how they can be integrated and used in LangChain.

![vectorstores](images/vectorstore.png)

**Understanding Vector Stores**

Vector stores are specialized databases designed to handle vector embeddings of data. These embeddings represent unstructured data, such as text, in a format that can be efficiently searched and compared. The typical workflow involves embedding the data and storing these vectors, followed by embedding queries at runtime to find the 'most similar' vectors in the store. This process is fundamental in applications like semantic search, where the goal is to retrieve information that is contextually relevant to a given query.

**Process Overview**
* Load Source Data: Initially, raw data is loaded, which can be any unstructured format like text documents.
* Query Vector Store: At runtime, queries are embedded and sent to the vector store.
* Retrieve 'Most Similar' Results: The vector store returns the data that is closest in embedding space to the query vector.

**Getting Started with Vector Stores in LangChain**

To effectively use vector stores in LangChain, it is advisable to have a good understanding of text embedding models, as they play a crucial role in creating the vectors stored in these databases.

**Popular Vector Store Options in LangChain**
LangChain supports various vector store options, including:

* Chroma: A local machine-based vector database.
* FAISS: Facebook AI Similarity Search, suitable for large-scale similarity search.
* Lance: Another efficient option for vector storage and retrieval.

![Vectorstores](images/vdbs.webp)

## Create vectordatabase - FAISS

In [28]:
import os
import getpass

os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')

In [30]:
# !pip install faiss-cpu

In [31]:
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores import FAISS

# Load the document, split it into chunks, embed each chunk and load it into the vector store.
raw_documents = TextLoader('data/algebra_news.txt').load()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)
db = FAISS.from_documents(documents, OpenAIEmbeddings())

Created a chunk of size 702, which is longer than the specified 500
Created a chunk of size 511, which is longer than the specified 500
Created a chunk of size 1157, which is longer than the specified 500


### Similarity search - openai embeddings


In [40]:
query = "When did Algebra become university?"
docs = db.similarity_search(query)
print(docs[0].page_content)

Algebra became University for its 25th birthday

On its 25th birthday, Algebra completed the accreditation process and became a University - Zagreb, December 21, 2023

Algebra University started three additional professional study programs in the current academic year 2023-2024, with a record number of over 2,000 enrolled students. Three new university studies will start at the beginning of the next academic year.


### Similarity search - openai embeddings

In [41]:
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import HuggingFaceHubEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores import FAISS

# Load the document, split it into chunks, embed each chunk and load it into the vector store.
raw_documents = TextLoader('data/algebra_news.txt').load()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)
db = FAISS.from_documents(documents, HuggingFaceEmbeddings())

Created a chunk of size 702, which is longer than the specified 500
Created a chunk of size 511, which is longer than the specified 500
Created a chunk of size 1157, which is longer than the specified 500


### Similarity search


In [42]:
query = "When did Algebra become university?"
docs = db.similarity_search(query)
print(docs[0].page_content)

Algebra became University for its 25th birthday

On its 25th birthday, Algebra completed the accreditation process and became a University - Zagreb, December 21, 2023

Algebra University started three additional professional study programs in the current academic year 2023-2024, with a record number of over 2,000 enrolled students. Three new university studies will start at the beginning of the next academic year.


## Create vector database - Chroma


In [58]:
# !pip install chromadb

In [59]:
# Import
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
from langchain_community.embeddings import HuggingFaceHubEmbeddings
from langchain_community.vectorstores import Chroma

In [60]:
# load the document and split it into chunks
loader = TextLoader("data/algebra_news.txt")
documents = loader.load()

# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=30)
docs = text_splitter.split_documents(documents)

# create the open-source embedding function
# embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

from langchain_community.embeddings import HuggingFaceHubEmbeddings
embedding_function = HuggingFaceHubEmbeddings()

# load it into Chroma
db = Chroma.from_documents(docs, embedding_function)

Created a chunk of size 702, which is longer than the specified 500
Created a chunk of size 511, which is longer than the specified 500
Created a chunk of size 1157, which is longer than the specified 500


### Similarity search - openai embeddings


In [61]:
query = "When did Algebra become university?"
docs = db.similarity_search(query)
print(docs[0].page_content)

Algebra became University for its 25th birthday

On its 25th birthday, Algebra completed the accreditation process and became a University - Zagreb, December 21, 2023

Algebra University started three additional professional study programs in the current academic year 2023-2024, with a record number of over 2,000 enrolled students. Three new university studies will start at the beginning of the next academic year.


## Basic Example: Saving Data to Disk with Chroma

Building upon the previous example, if you aim to save the processed data to disk, you can do so by initializing the Chroma client and specifying the directory for data storage.

### Steps to Save Data with Chroma:

1. **Initialize Chroma Client**: Create an instance of the Chroma client.
2. **Set Save Directory**: Pass the path of the directory where you want the data to be saved as a parameter to the Chroma client.

### Caution:
- Chroma attempts to automatically save data to disk, but this process may encounter issues if multiple in-memory clients are operating simultaneously.
- **Best Practice**: To avoid data conflicts, ensure that only one client is active per directory path at any given time. This practice helps in maintaining data integrity and prevents overwriting or data loss.

By following these steps and precautions, you can efficiently save your processed data to disk using Chroma, ensuring data persistence and reliability.


In [63]:
# save to disk
db2 = Chroma.from_documents(docs, embedding_function, persist_directory="vectorstores/chroma_db")
docs = db2.similarity_search(query)
print(docs[0].page_content)

# load from disk
db3 = Chroma(persist_directory="vectorstores/chroma_db", embedding_function=embedding_function)
docs = db3.similarity_search(query)
print(docs[0].page_content)

Algebra became University for its 25th birthday

On its 25th birthday, Algebra completed the accreditation process and became a University - Zagreb, December 21, 2023

Algebra University started three additional professional study programs in the current academic year 2023-2024, with a record number of over 2,000 enrolled students. Three new university studies will start at the beginning of the next academic year.
Algebra became University for its 25th birthday

On its 25th birthday, Algebra completed the accreditation process and became a University - Zagreb, December 21, 2023

Algebra University started three additional professional study programs in the current academic year 2023-2024, with a record number of over 2,000 enrolled students. Three new university studies will start at the beginning of the next academic year.


# 4. RAG

Create Chroma db vectorstore with HuggingFace embeddings

In [64]:
# pdf file path
pdf_file_path = "data/student_guide.pdf"

In [65]:
loader = PyPDFLoader(pdf_file_path)

In [66]:
loader.load()

[Document(page_content='T: 00385 1 2222 182\nE: student@algebra.hr\nW: www.algebra.university\nSTUDENT \nGUIDE   \n2023/2024', metadata={'source': 'data/student_guide.pdf', 'page': 0}),
 Document(page_content='INTRODUCTION  2\nTABLE OF CONTENTS 3\nAbout Algebra University College  4\nCampus orientation 6\nPlan of Algebra University College premises  8\nCode of conduct  10\nStudy structure  14\nAcademic year 2023/2024 calendar  15\nWhat do I need to know as a student of \nAlgebra University College?  16\nInfoeduka – the digital student service or \ndigital notice board 16\nStudent status  17\nWhat are the conditions for enrolment in a \nhigher year of study? 17\nWhat is the semester status?  18\nWhat is X-ica and its purpose? 18\nECTS Credits and Learning Outcomes 19\nWhat are ECTS credits? 19\nWhat are learning outcomes? 19\nClasses organisation and attendance  20\nWhy is it important to attend classes? 20\nWhat are additional classes? 20\nEXAMINATIONS  21\nHow to pass an exam?  21\nWh

In [67]:
documents = loader.load()
print(documents)
print(len(documents))

[Document(page_content='T: 00385 1 2222 182\nE: student@algebra.hr\nW: www.algebra.university\nSTUDENT \nGUIDE   \n2023/2024', metadata={'source': 'data/student_guide.pdf', 'page': 0}), Document(page_content='INTRODUCTION  2\nTABLE OF CONTENTS 3\nAbout Algebra University College  4\nCampus orientation 6\nPlan of Algebra University College premises  8\nCode of conduct  10\nStudy structure  14\nAcademic year 2023/2024 calendar  15\nWhat do I need to know as a student of \nAlgebra University College?  16\nInfoeduka – the digital student service or \ndigital notice board 16\nStudent status  17\nWhat are the conditions for enrolment in a \nhigher year of study? 17\nWhat is the semester status?  18\nWhat is X-ica and its purpose? 18\nECTS Credits and Learning Outcomes 19\nWhat are ECTS credits? 19\nWhat are learning outcomes? 19\nClasses organisation and attendance  20\nWhy is it important to attend classes? 20\nWhat are additional classes? 20\nEXAMINATIONS  21\nHow to pass an exam?  21\nWha

In [69]:
print(documents[5].page_content)

The building of the University College is open from 7:30 a.m. 
to 10:00 p.m. on weekdays and from 7:30 a.m. to 9:00 p.m. on 
Saturdays. 
The campus is closed on Sundays.Pursuant to Article 28 of the Algebra University College Statute, in accordance 
with the Code of Ethics of Algebra University College and the Book of 
Regulations on Disciplinary and Material Responsibility of students, teaching 
and non-teaching staff of Algebra University College, on 9 February 2018, the 
Dean of Algebra University College introduced the
CODE OF CONDUCT FOR STUDENTS OF 
ALGEBRA UNIVERSITY COLLEGE
Students, lecturers, associates and employees of Algebra University 
College need to work together to create an optimal environment and 
conditions for the realisation of honourable academic work.
The Code is primarily based on the desire of students and teachers 
to avoid and eliminate unacceptable behaviour and to establish a 
relationship based on mutual respect.
Students’ duties are:
→ To preserve the re

### Document Splitting
Initialize Recursive Text Splitter

In [70]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1500,
            chunk_overlap=250,
            length_function = len
        )
print(text_splitter)


<langchain.text_splitter.RecursiveCharacterTextSplitter object at 0x000001FFE5268950>


In [71]:
print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

Document 1:

Algebra became University for its 25th birthday

On its 25th birthday, Algebra completed the accreditation process and became a University - Zagreb, December 21, 2023

Algebra University started three additional professional study programs in the current academic year 2023-2024, with a record number of over 2,000 enrolled students. Three new university studies will start at the beginning of the next academic year.
----------------------------------------------------------------------------------------------------
Document 2:

In addition to the new university studies, Algebra University plans to further develop its professional studies through membership in the European Association of Institutions in Higher Education - EURASHE.


Gaming and esports experts at Algebra
----------------------------------------------------------------------------------------------------
Document 3:

Algebra University's mission is to develop a dynamic and innovative learning environment that e

### Embeddings and Vectorstore

Create Chroma db vectorstore with HuggingFace embeddings

In [72]:
from langchain.vectorstores import Chroma

vectordb = Chroma.from_documents(
    documents=docs,
    embedding=HuggingFaceEmbeddings(model_name="sentence-transformers/paraphrase-MiniLM-L6-v2"),
    persist_directory="qa_db",
)
vectordb.persist()


In [73]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS, Chroma

vectordb = Chroma.from_documents(
    documents=docs,
    embedding = OpenAIEmbeddings(),
    persist_directory="qa_open",
    collection_name="guide",
)
vectordb.persist()

  warn_deprecated(


In [74]:
print('\n\nCollection count is:', vectordb._collection.count())



Collection count is: 4


# Example

## PDF data

In [75]:
# load the list of pdf files in data folder and loop through each file and load it
pdf_files = os.listdir("data/")
pdf_file_paths = [os.path.join('data/', f) for f in os.listdir('data/') if f.endswith('.pdf')]
print(pdf_file_paths)

for pdf_file_path in pdf_file_paths:
    loader = PyPDFLoader(pdf_file_path)
    documents = loader.load()
    print(len(documents))

    # text splitter
    text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1800,
            chunk_overlap=180,
            length_function = len
        )
    print(text_splitter)

    docs = text_splitter.split_documents(documents)
    print(len(docs))

    pi_db = Chroma.from_documents(
        documents=docs, 
        embedding=HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"),
        # embedding = OpenAIEmbeddings(),
        persist_directory="reports",
    )
    
    pi_db.persist()
    print('\n\nCollection count is:', pi_db._collection.count())


['data/call_admission_20_21.pdf', 'data/HT_report.pdf', 'data/student_guide.pdf']
3
<langchain.text_splitter.RecursiveCharacterTextSplitter object at 0x000001FFC7169650>
7


Collection count is: 7
74
<langchain.text_splitter.RecursiveCharacterTextSplitter object at 0x000001FFE4C8C350>
277


Collection count is: 284
18
<langchain.text_splitter.RecursiveCharacterTextSplitter object at 0x000001FFE3E35590>
53


Collection count is: 337


In [121]:
## Load financial report vectorstore from disk
embedding_function = HuggingFaceEmbeddings(model_name="sentence-transformers/paraphrase-MiniLM-L6-v2")
repotdb = Chroma(
    persist_directory="reports", 
    embedding_function=embedding_function,
    )

### Retrieval

Similarity search - get the k most similar results to the query/question/message

In [122]:
def pretty_chunks(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))


In [123]:
query = "List the financial metrics like revenue, operating profit, comprehensive income from the consolidated statement of comprehensive for the year 2022."

In [124]:
query = "Extract the list of financial metrics from the consolidated financial statement of comprehensive income for the year 2022."

In [126]:
results_ss = pi_db.similarity_search(query, k=3)
pretty_chunks(results_ss)

Document 1:

Increase (decrease) / in provisions 12 36
Increase / (decrease) in employee benefit obligations 8 7
Increase in accruals 30 12 6
Other non-cash items (2) 7
Cash generated from operations 2,771 3,117
Interest paid (40) (78)
Income tax paid (132) (177)
Net cash flows from operating activities 2,599 2,862
Investing activities
Payments for non-current assets (1,467) (1,542)
Proceeds from sale of non-current assets 126 17
Disposal of subsidiary, net of cash disposed 3 - (3)
Proceeds from financial assets at fair value through other comprehensive income 201 14
Other investment received 3 -
Payments for secured deposits 26 (68) (38)
Interest received 10 5
Net cash flows used in investing activities (1,195) (1,547)CONSOLIDATED STATEMENT OF
CASH FLOWS
For the year ended 31 December 2022
----------------------------------------------------------------------------------------------------
Document 2:

Consolidated income statement
Consolidated balance sheet
Consolidated cash flow stat

In [127]:
query = "Extract the list of financial metrics from the consolidated financial statement of comprehensive income for the year 2022."

In [129]:
results_mmr=pi_db.max_marginal_relevance_search(query,k=3, fetch_k=15)
pretty_chunks(results_mmr)

Document 1:

Increase (decrease) / in provisions 12 36
Increase / (decrease) in employee benefit obligations 8 7
Increase in accruals 30 12 6
Other non-cash items (2) 7
Cash generated from operations 2,771 3,117
Interest paid (40) (78)
Income tax paid (132) (177)
Net cash flows from operating activities 2,599 2,862
Investing activities
Payments for non-current assets (1,467) (1,542)
Proceeds from sale of non-current assets 126 17
Disposal of subsidiary, net of cash disposed 3 - (3)
Proceeds from financial assets at fair value through other comprehensive income 201 14
Other investment received 3 -
Payments for secured deposits 26 (68) (38)
Interest received 10 5
Net cash flows used in investing activities (1,195) (1,547)CONSOLIDATED STATEMENT OF
CASH FLOWS
For the year ended 31 December 2022
----------------------------------------------------------------------------------------------------
Document 2:

Consolidated income statement
Consolidated balance sheet
Consolidated cash flow stat

# Question & Answering
### Extraction through prompting

Consolidated Financial Statements represent the combined financial results and position of a parent company and its subsidiaries. This means they provide a comprehensive view of the financial health of an entire group of companies as opposed to just one company's standalone financial position. The purpose is to present, for the parent and its subsidiaries, financial information that is a reflection of the economic activities as if the group were a single economic entity


* Single economic entity view: By consolidating, users of the financial statement can see the total of what's owned and owed by the entire group.

* Avoid double counting: If a parent company has transactions with its subsidiaries, those transactions are eliminated in the consolidated statements to avoid inflating revenues, expenses, assets, or liabilities.

* Regulatory and standard setting bodies: Most jurisdictions, especially those that adhere to the International Financial Reporting Standards (IFRF) or U.S. Generally Accepted Accounting Principles (GAAP), require consolidated financial statements from parent companies for external reporting if they own more than a certain percentage of another company.

### Content of Consolidated Financial Statements:

#### Consolidated Balance Sheet (or Statement of Financial Position):

* Assets: What the group owns. This includes current assets (like cash, accounts receivable, inventories) and non-current assets (like property, plant, equipment, intangibles).
* Liabilities: What the group owes. This includes current liabilities (like accounts payable, short-term loans) and non-current liabilities (like long-term debt).
* Equity: Owner's share in the group. This includes issued capital, reserves, retained earnings, non-controlling interest (which represents the share of ownership in a subsidiary not held by the parent).
#### Consolidated Income Statement (or Statement of Comprehensive Income):

* Revenues: Incomes generated by the group.
* Expenses: Costs incurred to generate revenues.
* Profit or Loss: Difference between revenues and expenses.
* Other Comprehensive Income: Represents changes in equity during a period from transactions and other events from non-owner sources.
#### Consolidated Statement of Changes in Equity:

* Shows the movements in shareholders' equity over a period, including new share issues, dividends, and profit or loss for the period.
#### Consolidated Cash Flow Statement:

* Operating Activities: Cash flows from primary activities of the group.
* Investing Activities: Cash flows from acquiring and disposing of long-term assets.
* Financing Activities: Cash flows from activities that result in changes in the size and composition of the equity and borrowings of the entity.
* Notes to the Financial Statements:

These provide additional information about the figures in the financial statements. They can be about the accounting policies, breakdowns of certain figures, details about debt, and so forth.
To prepare consolidated financial statements, intercompany transactions (like sales from a parent to a subsidiary) are eliminated, and only external transactions are shown. Furthermore, the equity of the subsidiaries is replaced by the parent's investment in the subsidiaries and any non-controlling interests.

## HT Annual Report

## 1. Consolidated income statement key financial terms

In [114]:
# from langchain.llms import OpenAI
# llm = OpenAI(temperature=0, 
#              model_name="gpt-3.5-turbo",
#                 openai_api_key=os.environ.get("OPENAI_API_KEY"),
#                 verbose=True
#                 )

In [103]:
from langchain.llms import OpenAI
llm = OpenAI(temperature=0, model_name="gpt-3.5-turbo-instruct", openai_api_key=os.environ.get("OPENAI_API_KEY"), verbose=True)


In [130]:

question = "Extract the list of all financial metrics found in the table the consolidated financial statement of comprehensive income for year 2021 and 2022.\
    Provide the output metrics in a list format e.g. ['metric_a', 'metric_b', etc]. Each metric should be a string element.\
    Only use financial metrics from the source documents. If you are unsure, or if a metric is not part of consolidated financial statement of comprehensive income.\
    This is mandatory, and under no circumstances should a financial metric be guessed or created or made up.Ensure your answer is entirely based on the source documents."

qa_chain = RetrievalQA.from_chain_type(llm, retriever=pi_db.as_retriever(), verbose=True, return_source_documents=True)
result_metrics = qa_chain({"query": question})
print(result_metrics["result"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m

['Intangible assets', 'Property, plant and equipment', 'Non-current financial assets', 'Receivables', 'Prepayments and accrued income', 'Lessee use rights to leased assets (IFRS 16)', 'Contract assets (IFRS 15)', 'Contract costs (IFRS 15)', 'Deferred tax asset', 'Total non-current assets', 'Inventories', 'Assets held for sale', 'Current financial assets', 'Contract assets (IFRS 15)', 'Contract costs (IFRS 15)', 'Cash and cash equivalents', 'Prepayments and accrued income', 'Total current assets', 'TOTAL ASSETS', 'Mobile revenue', 'Fixed revenue', 'System solutions', 'Revenue', 'Other operating income', 'Total operating revenue', 'Operating expenses', 'Material expenses', 'Employee benefits expenses', 'Other expenses', 'Work performed by the Group and capitalised', 'Net impairment losses on trade receivables and contract assets', 'Depreciation and amortization', 'EBIT']


In [131]:
financial_metrics = result_metrics["result"]
print(financial_metrics)


['Intangible assets', 'Property, plant and equipment', 'Non-current financial assets', 'Receivables', 'Prepayments and accrued income', 'Lessee use rights to leased assets (IFRS 16)', 'Contract assets (IFRS 15)', 'Contract costs (IFRS 15)', 'Deferred tax asset', 'Total non-current assets', 'Inventories', 'Assets held for sale', 'Current financial assets', 'Contract assets (IFRS 15)', 'Contract costs (IFRS 15)', 'Cash and cash equivalents', 'Prepayments and accrued income', 'Total current assets', 'TOTAL ASSETS', 'Mobile revenue', 'Fixed revenue', 'System solutions', 'Revenue', 'Other operating income', 'Total operating revenue', 'Operating expenses', 'Material expenses', 'Employee benefits expenses', 'Other expenses', 'Work performed by the Group and capitalised', 'Net impairment losses on trade receivables and contract assets', 'Depreciation and amortization', 'EBIT']


Is it 'Profit attributable to owners of the parent' the same as the 'Profit attributable to Equity holders of the Company'? 

"Profit attributable to owners of the parent" and "Profit attributable to equity holders of the company" are effectively the same in financial reporting contexts. Both terms refer to the net profit (or loss) for a given period that belongs to the equity shareholders of the primary or parent company, as opposed to minority or non-controlling interests.

When a company has subsidiaries, and it consolidates its financial statements, there can be a portion of the profit that belongs to the parent company's shareholders and another portion that belongs to the non-controlling interests (also known as minority interests). The terminology used can vary based on the specific accounting standards followed (e.g., International Financial Reporting Standards or U.S. Generally Accepted Accounting Principles) and by region or jurisdiction.

## Extract each financial position and corresponding values and metadata

In [132]:
financial_metrics

"\n['Intangible assets', 'Property, plant and equipment', 'Non-current financial assets', 'Receivables', 'Prepayments and accrued income', 'Lessee use rights to leased assets (IFRS 16)', 'Contract assets (IFRS 15)', 'Contract costs (IFRS 15)', 'Deferred tax asset', 'Total non-current assets', 'Inventories', 'Assets held for sale', 'Current financial assets', 'Contract assets (IFRS 15)', 'Contract costs (IFRS 15)', 'Cash and cash equivalents', 'Prepayments and accrued income', 'Total current assets', 'TOTAL ASSETS', 'Mobile revenue', 'Fixed revenue', 'System solutions', 'Revenue', 'Other operating income', 'Total operating revenue', 'Operating expenses', 'Material expenses', 'Employee benefits expenses', 'Other expenses', 'Work performed by the Group and capitalised', 'Net impairment losses on trade receivables and contract assets', 'Depreciation and amortization', 'EBIT']"

In [133]:
print(len(financial_metrics))

882


In [134]:
import ast

# result = {}
# result["result"] = '["Revenue", "Other operating income", "Merchandise, material and energy expenses", "Service expenses", "Employee benefits expenses", "Work performed by the Group and capitalized", "Depreciation and amortization", "Impairment of non-current assets", "Net impairment losses on trade receivables and contract assets", "Other expenses", "Operating profit", "Finance income", "Finance costs", "Finance costs – net", "Profit before income tax", "Income tax expense", "Profit for the year", "Effects of foreign exchange", "Changes in the fair value of debt instruments at fair value", "Other comprehensive income / (loss) for the year, net of tax", "Total comprehensive income for the year, net of tax"]'

# Convert string representation of list to actual list
financial_metrics = ast.literal_eval(result_metrics["result"])

# Verify
print(financial_metrics)
print(len(financial_metrics))


['Intangible assets', 'Property, plant and equipment', 'Non-current financial assets', 'Receivables', 'Prepayments and accrued income', 'Lessee use rights to leased assets (IFRS 16)', 'Contract assets (IFRS 15)', 'Contract costs (IFRS 15)', 'Deferred tax asset', 'Total non-current assets', 'Inventories', 'Assets held for sale', 'Current financial assets', 'Contract assets (IFRS 15)', 'Contract costs (IFRS 15)', 'Cash and cash equivalents', 'Prepayments and accrued income', 'Total current assets', 'TOTAL ASSETS', 'Mobile revenue', 'Fixed revenue', 'System solutions', 'Revenue', 'Other operating income', 'Total operating revenue', 'Operating expenses', 'Material expenses', 'Employee benefits expenses', 'Other expenses', 'Work performed by the Group and capitalised', 'Net impairment losses on trade receivables and contract assets', 'Depreciation and amortization', 'EBIT']
33


In [135]:
for metric in financial_metrics:
    print(metric)   

Intangible assets
Property, plant and equipment
Non-current financial assets
Receivables
Prepayments and accrued income
Lessee use rights to leased assets (IFRS 16)
Contract assets (IFRS 15)
Contract costs (IFRS 15)
Deferred tax asset
Total non-current assets
Inventories
Assets held for sale
Current financial assets
Contract assets (IFRS 15)
Contract costs (IFRS 15)
Cash and cash equivalents
Prepayments and accrued income
Total current assets
TOTAL ASSETS
Mobile revenue
Fixed revenue
System solutions
Revenue
Other operating income
Total operating revenue
Operating expenses
Material expenses
Employee benefits expenses
Other expenses
Work performed by the Group and capitalised
Net impairment losses on trade receivables and contract assets
Depreciation and amortization
EBIT


In [136]:
financial_metrics = ['Revenue', 'Other operating income', 'Merchandise, material and energy expenses', 'Service expenses', 'Employee benefits expenses', 'Work performed by the Group and capitalized', 'Depreciation and amortization', 'Impairment of non-current assets', 'Net impairment losses on trade receivables and contract assets', 'Other expenses', 'Operating profit', 'Finance income', 'Finance costs', 'Finance costs – net', 'Profit before income tax', 'Income tax expense', 'Profit for the year', 'Effects of foreign exchange', 'Changes in the fair value of debt instruments at fair value', 'Other comprehensive income / (loss) for the year, net of tax', 'Total comprehensive income for the year, net of tax', 'Profit attributable to: Equity holders of the Company', 'Profit attributable to: Non-controlling interest', 'Other comprehensive income for the year from continuing operations, net of tax']

## Extract the metrics information from income statement

In [137]:
query = "List the financial metrics from the consolidated statement of comprehensive for the year 2022. Provide the output in a json format with keys: year, metric_name, metirc_value, metric_currency, metric_description.\
    The output should be a list of dictionaries. Each dictionary should have the keys: year, metric_name, metirc_value, metric_currency.\
    The year should be a date element. The metric_value should be a float element. The metric_currency should be a string element.\
    The metric_description should contain the two sentences description of the metric based on the financial elements comprising the observed metric."

from langchain.llms import OpenAI
llm = OpenAI(temperature=0, model_name="gpt-3.5-turbo-instruct", openai_api_key=os.environ.get("OPENAI_API_KEY"), verbose=True)

qa_chain = RetrievalQA.from_chain_type(llm, retriever=pi_db.as_retriever(), verbose=True, return_source_documents=True)
result = qa_chain({"query": query})



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [138]:
print(result['result'])
# print result in a pretty way
print(f"Question: {query}\n")
print(f"Answer: {result}\n")  
print(f"Source Document: {result['source_documents']}\n")




[
  {
    "year": "2022",
    "metric_name": "Revenue",
    "metric_value": 7.41,
    "metric_currency": "HRK",
    "metric_description": "The total income generated from the sale of goods and services during the year."
  },
  {
    "year": "2022",
    "metric_name": "Operating profit",
    "metric_value": 946,
    "metric_currency": "HRK",
    "metric_description": "The profit earned from the company's core operations, after deducting operating expenses."
  },
  {
    "year": "2022",
    "metric_name": "Profit before income tax",
    "metric_value": 884,
    "metric_currency": "HRK",
    "metric_description": "The total profit earned before accounting for income tax expenses."
  },
  {
    "year": "2022",
    "metric_name": "Profit for the year",
    "metric_value": 660,
    "metric_currency": "HRK",
    "metric_description": "The total profit earned after accounting for all expenses and taxes for the year."
  },
  {
    "year": "2022",
    "metric_name": "
Question: List the financ

In [139]:
# Get the number of source documents
print('\n\nNumber of source documents is:', len(result["source_documents"]))
# Store the number of source documents in a variable len_source_docs
len_source_docs = len(result["source_documents"])
# Loop over the source documents and print the source number and page content 
for i in range(len_source_docs):
    print('\n\nSource number:', i)
    print('\nSource page content is:\n\n', result["source_documents"][i].page_content)



Number of source documents is: 4


Source number: 0

Source page content is:

 Consolidated income statement
Consolidated balance sheet
Consolidated cash flow statementHT GROUP FINAN-CIAL STATE -MENTS


Source number: 1

Source page content is:

 116 11716  Property, plant and equipment
 Land and
buildings
HRK millionTelecom
plant and
machinery
HRK millionTools, 
vehicles, IT and 
office
equipment
HRK millionAssets under
construction
HRK millionTotal
HRK million
As at 1 January 2021
Cost 2,500 16,353 1,137 195 20,185
Accumulated depreciation and 
impairment losses(1,616) (11,136) (929) (4) (13,685)
Net book value 884 5,217 208 191 6,500
Year ended 31 December 2021
Opening net book value 884 5,217 208 191 6,500
Additions 4 182 37 940 1,163
Transfers 210 (214) 77             (73) -
Disposals (13)                    - - - (13)
Disposal of subsidiary (4)            (272) (3) (22) (301)
Depreciation charge (104)            (848) (95) - (1,047)
Impairment loss -                (13) - - (13

# Data source batch processing

## Word documents data

In [149]:
# load the list of pdf files in ../data folder and loop through each file and load it
from langchain.document_loaders import Docx2txtLoader
pdf_files = os.listdir("data/")
pdf_file_paths = [os.path.join('data/', f) for f in os.listdir('data/') if f.endswith('.docx')]
print(pdf_file_paths)

for pdf_file_path in pdf_file_paths:
    loader = Docx2txtLoader(pdf_file_path)
    documents = loader.load()
    print(len(documents))

    # text splitter
    text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1800,
            chunk_overlap=180,
            length_function = len
        )
    print(text_splitter)

    docs = text_splitter.split_documents(documents)
    print(len(docs))

    pi_db = Chroma.from_documents(
        documents=docs, 
        embedding=HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"),
        persist_directory="data",
    )
    
    pi_db.persist()
    print('\n\nCollection count is:', pi_db._collection.count())


['../data/Obrazac najave službenog putovanja.docx', '../data/Procedura_uvođenja_zaposlenika_u_rad_(za_mentorirane)_0.7.docx']
1
<langchain.text_splitter.RecursiveCharacterTextSplitter object at 0x000002695D28E710>
1


Collection count is: 161
1
<langchain.text_splitter.RecursiveCharacterTextSplitter object at 0x000002694ECABDF0>
10


Collection count is: 171


## PPT data

In [150]:
# load the list of pdf files in ../data folder and loop through each file and load it
from langchain.document_loaders import UnstructuredPowerPointLoader

pdf_files = os.listdir("data/")
pdf_file_paths = [os.path.join('data/', f) for f in os.listdir('data/') if f.endswith('.pptx')]
print(pdf_file_paths)

for pdf_file_path in pdf_file_paths:
    loader = UnstructuredPowerPointLoader(pdf_file_path)
    # loader = PyPDFLoader(pdf_file_path)
    documents = loader.load()
    print(len(documents))

    # text splitter
    text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1800,
            chunk_overlap=180,
            length_function = len
        )
    print(text_splitter)

    docs = text_splitter.split_documents(documents)
    print(len(docs))

    pi_db = Chroma.from_documents(
        documents=docs, 
        embedding=HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"),
        persist_directory="data",
    )
    
    pi_db.persist()
    print('\n\nCollection count is:', pi_db._collection.count())


['../data/Prezentacija rezultata ankete o benefitima i akcijski plan.pptx', '../data/Vodič za snalaženje za nove zaposlenike v02122020 (1).pptx']
1
<langchain.text_splitter.RecursiveCharacterTextSplitter object at 0x000002696030A980>
3


Collection count is: 174
1
<langchain.text_splitter.RecursiveCharacterTextSplitter object at 0x000002689647F190>
25


Collection count is: 199


In [93]:
print(data[0].page_content)

 Bird Academy 2016 – Poslovna inteligencija                         Skip to content     ServicesImplementation servicesOLAP & ReportingBusiness / Data AnalysisData IntegrationData EngineeringPlanning & Financial ConsolidationData Governance, Data Quality & Master Data ManagementData MigrationData ScienceStrategic ICT consultingEducationsSolutionsDWH ModelsPi AnalyticsPi ProfitabilityPlanning for NonprofitsConQ Content AnalyticsSynQXBRLpointPartnersTechnology PartnersSolution partnersNewsAbout usCompany ProfileOur clientsSelected projectsSocial responsibilityPrizes and awardsEU projectsContactsBlogCareers ServicesImplementation servicesOLAP & ReportingBusiness / Data AnalysisData IntegrationData EngineeringPlanning & Financial ConsolidationData Governance, Data Quality & Master Data ManagementData MigrationData ScienceStrategic ICT consultingEducationsSolutionsDWH ModelsPi AnalyticsPi ProfitabilityPlanning for NonprofitsConQ Content AnalyticsSynQXBRLpointPartnersTechnology PartnersSolut

In [None]:
from langchain.vectorstores c
pi_db = Chroma.from_documents(
    documents=docs,
    embedding=HuggingFaceEmbeddings(model_name="sentence-transformers/paraphrase-MiniLM-L6-v2"),
    persist_directory="../pidata/",
)
pi_db.persist()


Create Chroma db vectorstore with OpenAI embeddings

In [11]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

pi_db = Chroma.from_documents(
    documents=docs,
    embedding = OpenAIEmbeddings(),
    persist_directory="../pidata/",
    # collection_name="pdf_data",
)
pi_db.persist()

In [12]:
print('\n\nCollection count is:', pi_db._collection.count())



Collection count is: 453


## Load financial report vectorstore from disk

In [140]:

pi_db = Chroma(
    persist_directory="reports", 
    embedding_function=embedding_function,
    )

In [141]:
print('\n\nCollection count is:', pi_db._collection.count())



Collection count is: 337


## Example Retrieval

Similarity search - get the k most similar results to the query/question/message

In [143]:
def pretty_chunks(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))


In [144]:
results_ss = pi_db.similarity_search(query, k=3)
pretty_chunks(results_ss)

InvalidDimensionException: Embedding dimension 384 does not match collection dimensionality 768

In [46]:
results_mmr=pi_db.max_marginal_relevance_search(query,k=3, fetch_k=15)
pretty_chunks(results_mmr)

Document 1:

PRAV -99/17 Pravilnik o radu  1 / 25 
 
 
  
 
 
 
 
Broj: PRAV -99/17 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
PRAVILNIK O RADU
----------------------------------------------------------------------------------------------------
Document 2:

samostalnost u njihovom određivanju.  
 
Članak 3.  
Ako je neko pravo iz radnog odnosa različito uređeno ugovorom o radu, Pravilnikom o radu, 
sporazumom sklopljenim između radničkog vijeća i Poslodavca  ili zakonom, primjenjuje se za radnika 
najpovoljnije pravo, ako Zakonom o radu ili drugim zakonom nije dru gačije određeno.  
 
Članak 4.  
Radnik je dužan:  
 radne zadatke i poslove utvrđene ugovorom o radu uredno i pravodobno izvršavati u skladu s 
naravi i vrstom rada te internim aktima Poslodavca , 
 čuvati imovinu Poslodavca  i skrbiti o njenom održavanju,  
 biti na rad nom mjestu na vrijeme,  
 u slučaju spriječenosti dolaska na posao obavijestiti o tome bez odlaganja neposrednog 
rukovoditelja,  
 čuvati ugled i poslovnu tajnu

In [145]:
from langchain.llms import OpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor


llm = OpenAI(temperature=0, 
             model_name="gpt-3.5-turbo-instruct",
             openai_api_key=os.environ.get("OPENAI_API_KEY"),
             verbose=True
             )
                
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=pi_db.as_retriever()
)
compressed_docs = compression_retriever.get_relevant_documents(query, k=3)
pretty_chunks(compressed_docs)

InvalidDimensionException: Embedding dimension 384 does not match collection dimensionality 768

Initialize LLM model 

In [146]:
from langchain.llms import OpenAI

llm = OpenAI(temperature=0, 
             model_name="gpt-3.5-turbo-instruct",
             openai_api_key=os.environ.get("OPENAI_API_KEY"),
             verbose=True
             )

# Question & Answering


## Extract the metrics information from income statement

In [148]:
query = "Describe what is AI4VET4AI project and why is it important??"

template = '''Imaš ulogu stručnjaka za odnose s javnošću u tvrtki te marketing i promociju. 
Upoznat si s osnovnim pojmovima i znaš kako se odvija proces komuniciranja. 
U ovom slučaju odgovaraš na pitanja zaposlenika s obzirom na kontekst. Odgovore pišeš u obliku pitanja i odgovora.
Odgovore temeljiš na kontekstu i na pitanju.
Odgovori na hrvatskom jeziku.

{context}
Pitanje: {query}
Odgovor:'''


QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

from langchain.llms import OpenAI
llm = OpenAI(
    temperature=0, 
    model_name="gpt-3.5-turbo",
    openai_api_key=os.environ.get("OPENAI_API_KEY"),
    verbose=True
                )

qa_chain = RetrievalQA.from_chain_type(llm, retriever=pi_db.as_retriever(), verbose=True, return_source_documents=True)
result = qa_chain({"query": query})



[1m> Entering new RetrievalQA chain...[0m


  try:


InvalidDimensionException: Embedding dimension 384 does not match collection dimensionality 768

In [108]:
result

{'query': 'Describe what is AI4VET4AI project and why is it important??',
 'result': 'The AI4VET4AI project is a project that aims to support the growth of an AI-skilled workforce in Europe, with a special emphasis on VET (Vocational Education and Training) teachers and learners. It addresses the need for trained workers in the field of AI. The project is important because it recognizes the increasing importance of AI in various industries and aims to ensure that there are enough skilled workers who can effectively utilize AI technology. By focusing on VET teachers and learners, the project aims to provide the necessary training and education to equip individuals with the skills needed to succeed in the field of AI.',
 'source_documents': [Document(page_content="team members. AI4VET4AI project\xa0addresses the need for trained workers in the field of AI, by supporting the growth of an AI-skilled workforce in Europe, with a special emphasis on VET teachers and learners.   Read More    0

In [109]:
print('\nAnswer:\n\n', result["result"])


Answer:

 The AI4VET4AI project is a project that aims to support the growth of an AI-skilled workforce in Europe, with a special emphasis on VET (Vocational Education and Training) teachers and learners. It addresses the need for trained workers in the field of AI. The project is important because it recognizes the increasing importance of AI in various industries and aims to ensure that there are enough skilled workers who can effectively utilize AI technology. By focusing on VET teachers and learners, the project aims to provide the necessary training and education to equip individuals with the skills needed to succeed in the field of AI.


In [110]:
# print the result in a more readable format
print('\nAnswer:\n\n', result["result"].replace('\n', ' '))


Answer:

 The AI4VET4AI project is a project that aims to support the growth of an AI-skilled workforce in Europe, with a special emphasis on VET (Vocational Education and Training) teachers and learners. It addresses the need for trained workers in the field of AI. The project is important because it recognizes the increasing importance of AI in various industries and aims to ensure that there are enough skilled workers who can effectively utilize AI technology. By focusing on VET teachers and learners, the project aims to provide the necessary training and education to equip individuals with the skills needed to succeed in the field of AI.


In [139]:
# Pretty print the result
from pprint import pprint
pprint(result)  


{'query': 'Describe what is AI4VET4AI project and why is it important??',
 'result': 'The AI4VET4AI project is a project that aims to support the growth '
           'of an AI-skilled workforce in Europe, with a special emphasis on '
           'VET (Vocational Education and Training) teachers and learners. It '
           'addresses the need for trained workers in the field of AI. The '
           'project is important because it recognizes the increasing '
           'importance of AI in various industries and aims to ensure that '
           'there are enough skilled workers who can effectively utilize AI '
           'technology. By focusing on VET teachers and learners, the project '
           'aims to provide the necessary training and education to equip '
           'individuals with the skills needed to succeed in the field of AI.',
 'source_documents': [Document(page_content="team members. AI4VET4AI project\xa0addresses the need for trained workers in the field of AI, by supp

In [138]:
# Print source documents
print('\nSource documents:\n\n', result["source_documents"][0].metadata)


Source documents:

 {'lastmod': '2021-04-18T08:36:09+00:00', 'loc': 'https://inteligencija.com/en/industrije/telecommunications/', 'source': 'https://inteligencija.com/en/industrije/telecommunications/'}
