<div style="padding: 35px;color:white;margin:10;font-size:200%;text-align:center;display:fill;border-radius:10px;overflow:hidden;background-image: url(https://images.pexels.com/photos/7078619/pexels-photo-7078619.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=1)"><b><span style='color:black'><strong>M-Pesa finances tracker RAG(Retrieval Augmented Generation) application </strong></span></b> </div> 

## <div style="padding: 20px;color:white;margin:10;font-size:90%;text-align:left;display:fill;border-radius:10px;overflow:hidden;background-image: url(https://w0.peakpx.com/wallpaper/957/661/HD-wallpaper-white-marble-white-stone-texture-marble-stone-background-white-stone.jpg)"><b><span style='color:black'> Problem statement</span></b> </div>

In the age of `Generative AI`, there has been a growing application of it called `retrieval Augmented Generation` where one can query their data using natural language processing as if chatting with the documents by asking questions and getting responses. 

For this project, it involves querying M-Pesa statements and assesing the spending pattern using these applications which vectorizes and embeds the pdf text and stores it into vector databases and a `simple similarity search` is applied to the stored vectors. This helps get the correct response from the documents and stored vectors. 

## <div style="padding: 20px;color:white;margin:10;font-size:90%;text-align:left;display:fill;border-radius:10px;overflow:hidden;background-image: url(https://w0.peakpx.com/wallpaper/957/661/HD-wallpaper-white-marble-white-stone-texture-marble-stone-background-white-stone.jpg)"><b><span style='color:black'> Importing libraries</span></b> </div>

The libraries used include:

* Llama-index for performing embeddings and referencing llm models
* OpenAI for openAI embeddings and converting the queries to vectors
* Pypdf for extracting content from the pdf

The libraries can be found in the `requirements.txt` file and running the `pip install -r requirements.txt` command in the terminal installs all the required libraries and their respective versions and dependencies. 


In [33]:

from pprint import pprint
from dotenv import load_dotenv
import os 
load_dotenv()
from dotenv import dotenv_values

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
#Allows printing of the response source along with the similarity score. 
from llama_index.core.response.pprint_utils import pprint_response

### <b> <span style='color:#16C2D5'>|</span> PDF file decryption</b> 

Since the file is password protrected, decryption of the file is necessary to allow for data extraction uisng the `pypdf` library. 

In [11]:
#Decrypting the mpesa statement
decrypted_pdf_path = "Decrypted_MPESA_Statement.pdf"
doc = fitz.open("MPESA_Statement_2023-05-28_to_2024-05-28.pdf")
if doc.authenticate(str(598683)):
    doc.save(decrypted_pdf_path)
else:
    raise ValueError("Failed to authenticate the PDF with the provided password")

After `file decryption`, the next step is loading the `OPENAI_API_KEY` from the environment file for performing embeddings and performing a similarity search for the vectors. 

The `Python-dotenv library` allows for various methods of loading the environment variables one being the `load_env()` and teh other being using the `config` to get the environment variable values. It is only the `OPENAI-API-KEY` that is stored in the environment file. 

In [5]:
config = dotenv_values('.env')

In [6]:
os.environ['OPENAI_API_KEY'] = config.get("OPENAI_API_KEY")

In [7]:
documents = SimpleDirectoryReader("data").load_data()

In [8]:
documents

[Document(id_='5feabd30-2e41-47a4-ba81-7762fed1f630', embedding=None, metadata={'page_label': '1', 'file_name': 'Decrypted_MPESA_Statement.pdf', 'file_path': 'd:\\Projects\\M-PesaFinanceTracker\\data\\Decrypted_MPESA_Statement.pdf', 'file_type': 'application/pdf', 'file_size': 1552274, 'creation_date': '2024-05-29', 'last_modified_date': '2024-05-29'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text=' \nSUMMARY  \n \nDETAILED STATEMENT  M-PESA STATEMENT\nCustomer Name: ISAAC ODHIAMBO ODERA\nMobile Number: 0702559770\nEmail Address: oderaisaack@gmail.com\nStatement Period: 28 May 2023 - 28 May 2024\nRequest Date: 28 May 2024\nTRANSACTION TYPE PAID IN PAID OUT\nSEND MONEY: 0.00 104,486.00\nRECEIVED MONEY: 65,180.87 0.00\nAGENT DEPOSIT: 800.00 0.00\nAGENT WI

Convert the documents into vectors using the `vectorStoreIndex` from `llama-index`. 

In [9]:
index = VectorStoreIndex.from_documents(documents)

In [10]:
index

<llama_index.core.indices.vector_store.base.VectorStoreIndex at 0x22e1b788100>

## <div style="padding: 20px;color:white;margin:10;font-size:90%;text-align:left;display:fill;border-radius:10px;overflow:hidden;background-image: url(https://w0.peakpx.com/wallpaper/957/661/HD-wallpaper-white-marble-white-stone-texture-marble-stone-background-white-stone.jpg)"><b><span style='color:black'> Query Engine</span></b> </div>

After converting the PDF values to vectors for performing a similarity search, the query engine can now be initiated safely.  

In [11]:
query_engine = index.as_query_engine()

In [35]:
response = query_engine.query("How many days did I make huge withdrawals and what was teh amount withdrawn?")

In [39]:
print(response)

You made huge withdrawals on two separate occasions. The first withdrawal was for 82,582.00 and the second withdrawal was for 2,005.69.


In [38]:
# pprint_response(response, show_source=True)
# print(response)

In [14]:
response2 = query_engine.query("does my monthly and weekly withdrawals have a pattern?")

In [16]:
print(response2)

Yes, there seems to be a pattern in your weekly and monthly withdrawals. The transactions show a mix of withdrawals, payments to small businesses, pay bill charges, and funds received from various sources. The withdrawals and payments are consistent throughout the weeks and months, indicating a regular financial activity pattern.


In [21]:
response3 = query_engine.query("How much in total did I withdraw for airtime purchase?")

In [22]:
print(response3)

You withdrew a total of 289 KES for airtime purchase.


In [25]:
response4 = query_engine.query("How much was paid in via KCB?")

In [26]:
print(response4)

18,200.00 was paid in via KCB.


In [42]:
response5 = query_engine.query("How much money in total was sent by Michael?")

In [43]:
print(response5)

Michael sent a total of 80.00.


In [46]:
response6 = query_engine.query("generate a bar graph of my daily expenditure for the month of january")

In [48]:
print(response6)

To generate a bar graph of your daily expenditure for the month of January, you would need to extract the relevant transaction data for January from the provided detailed statements for each page. Then, calculate the total expenditure for each day in January and plot this data on a bar graph with the x-axis representing the days of January and the y-axis representing the total expenditure for each day.
