<div style="padding: 35px;color:white;margin:10;font-size:200%;text-align:center;display:fill;border-radius:10px;overflow:hidden;background-image: url(https://images.pexels.com/photos/7078619/pexels-photo-7078619.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=1)"><b><span style='color:black'><strong>M-Pesa finances tracker RAG(Retrieval Augmented Generation) application </strong></span></b> </div> 

## <div style="padding: 20px;color:white;margin:10;font-size:90%;text-align:left;display:fill;border-radius:10px;overflow:hidden;background-image: url(https://w0.peakpx.com/wallpaper/957/661/HD-wallpaper-white-marble-white-stone-texture-marble-stone-background-white-stone.jpg)"><b><span style='color:black'> Problem statement</span></b> </div>

In the age of `Generative AI`, there has been a growing application of it called `retrieval Augmented Generation` where one can query their data using natural language processing as if chatting with the documents by asking questions and getting responses. 

For this project, it involves querying M-Pesa statements and assesing the spending pattern using these applications which vectorizes and embeds the pdf text and stores it into vector databases and a `simple similarity search` is applied to the stored vectors. This helps get the correct response from the documents and stored vectors. 

## <div style="padding: 20px;color:white;margin:10;font-size:90%;text-align:left;display:fill;border-radius:10px;overflow:hidden;background-image: url(https://w0.peakpx.com/wallpaper/957/661/HD-wallpaper-white-marble-white-stone-texture-marble-stone-background-white-stone.jpg)"><b><span style='color:black'> Importing libraries</span></b> </div>

The libraries used include:

* Llama-index for performing embeddings and referencing llm models
* OpenAI for openAI embeddings and converting the queries to vectors
* Pypdf for extracting content from the pdf

The libraries can be found in the `requirements.txt` file and running the `pip install -r requirements.txt` command in the terminal installs all the required libraries and their respective versions and dependencies. 


In [17]:

from pprint import pprint
from dotenv import load_dotenv
import os 
load_dotenv()
from dotenv import dotenv_values

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
#Allows printing of the response source along with the similarity score. 
from llama_index.core.response.pprint_utils import pprint_response
from llama_parse import LlamaParse
import nest_asyncio


### <b> <span style='color:#16C2D5'>|</span> PDF file decryption</b> 

Since the file is password protrected, decryption of the file is necessary to allow for data extraction uisng the `pypdf` library. 

In [11]:
#Decrypting the mpesa statement
decrypted_pdf_path = "Decrypted_MPESA_Statement.pdf"
doc = fitz.open("MPESA_Statement_2023-05-28_to_2024-05-28.pdf")
if doc.authenticate(str(598683)):
    doc.save(decrypted_pdf_path)
else:
    raise ValueError("Failed to authenticate the PDF with the provided password")

After `file decryption`, the next step is loading the `OPENAI_API_KEY` from the environment file for performing embeddings and performing a similarity search for the vectors. 

The `Python-dotenv library` allows for various methods of loading the environment variables one being the `load_env()` and teh other being using the `config` to get the environment variable values. It is only the `OPENAI-API-KEY` that is stored in the environment file. 

In [5]:
config = dotenv_values('.env')

In [6]:
#Loand environment variables
os.environ['OPENAI_API_KEY'] = config.get("OPENAI_API_KEY")
os.environ['LLAMA_CLOUD_API_KEY'] = config.get("LLAMA_CLOUD_API_KEY")

In [7]:
parser = LlamaParse(
    result_type="markdown"  # "markdown" and "text" are available
)

In [18]:
#Use the nest_ascyncio to be able to use the LlamaParser
nest_asyncio.apply()
file_extractor = {".pdf": parser}

documents = SimpleDirectoryReader("./data", file_extractor=file_extractor).load_data()


Started parsing the file under job_id 2d620a2a-c1eb-476f-81da-6a57ddb47c21
..

In [35]:
# documents

Convert the documents into vectors using the `vectorStoreIndex` from `llama-index`. 

In [20]:
index = VectorStoreIndex.from_documents(documents)

In [21]:
index

<llama_index.core.indices.vector_store.base.VectorStoreIndex at 0x2bfdead4460>

## <div style="padding: 20px;color:white;margin:10;font-size:90%;text-align:left;display:fill;border-radius:10px;overflow:hidden;background-image: url(https://w0.peakpx.com/wallpaper/957/661/HD-wallpaper-white-marble-white-stone-texture-marble-stone-background-white-stone.jpg)"><b><span style='color:black'> Query Engine</span></b> </div>

After converting the PDF values to vectors for performing a similarity search, the query engine can now be initiated safely.  

In [22]:
query_engine = index.as_query_engine()

In [23]:
response = query_engine.query("How many days did I make huge withdrawals and what was teh amount withdrawn?")

In [24]:
print(response)

You made a huge withdrawal on July 17, 2023, where you withdrew 6,500.00 units.


In [38]:
# pprint_response(response, show_source=True)
# print(response)

In [25]:
response2 = query_engine.query("does my monthly and weekly withdrawals have a pattern?")

In [26]:
print(response2)

There is a pattern in your monthly and weekly withdrawals.


In [27]:
response3 = query_engine.query("How much in total did I withdraw for airtime purchase?")

In [28]:
print(response3)

You withdrew a total of 350.00 for airtime purchase.


In [29]:
response4 = query_engine.query("How much was paid in via KCB?")

In [30]:
print(response4)

The amount paid in via KCB was 3,700.00.


In [31]:
response5 = query_engine.query("How much money in total was sent by Michael?")

In [32]:
print(response5)

Michael sent a total of 30.00 + 100.00 = 130.00 in transactions.


In [33]:
response6 = query_engine.query("generate a bar graph of my daily expenditure for the month of january")

In [34]:
print(response6)

You can generate a bar graph of your daily expenditure for the month of January by plotting the daily withdrawal amounts from your financial statement data for that specific month.
