## Making sense of MPESA transaction statement data over multiple years

#### Goal 

 - Extract data from PDF statements generated via MPESA App
 - Categorize each transaction 

#### Resources:
- [Better Programming](https://betterprogramming.pub/convert-tables-from-pdfs-to-pandas-with-python-d74f8ac31dc2)
- [Towards Data Science](https://towardsdatascience.com/how-to-extract-tables-from-pdf-using-python-pandas-and-tabula-py-c65e43bd754)

#### To do :

##### Remaining PDFs
- [X] Extract CSV from remaining PDFs
- [X] Combine all PDF data into a single DF for further analysis

##### Further clean up
 - [X] drop transaction status
 - [X] balance

##### Extract specific transactions 

 - [X] Pay bill charges total
 - [X] Create columns for other charges

##### Prepare for labeling and prediction

 - [X] Split date columns into YYYY, MM, DD, HH, MM, Day of the Week, Weekday vs Weekend

##### Natural Language 

 - [ ] Explore approaches, Natural Language models to make sense of the Details code
 - [ ] To explore Can use Receipt No as key? Is there a pattern to the MPESA codes?


##### Other
 - [ ] PDF In (e.g via email, copy to storage bucket, cloud function to convert to CSV and/or SQL db/warehouse 
 - [ ] 



In [None]:
# imports

import tabula
import pandas as pd

# For no-code exploration using Bamboo Lib
# https://docs.bamboolib.8080labs.com/documentation/how-tos/installation-and-setup/install-bamboolib
#import bamboolib as bam

### 1 - Processing a single PDF file at a time [Redundant]



In [None]:
# Define directory and file names where PDFs are located
# Can be customized 

#directory = "~/dev/pdf/"
#file = "20200101_20200630.pdf"
#file_path = directory + file

In [None]:
#Convert the first page

#list_df = tabula.read_pdf(file_path)

# Output is a list of two dataframes  because the first page of MPESA statement has a summary table (list_df[0])  
# followed by a detailed table list_df[1].

# But we would need to convert the entire table.

In [None]:
#Convert the entire document

#list_df = tabula.read_pdf(file_path, pages='all')

# Output is a list of dataframes. The list is of length N + 1 where N is the number of pages in the PDF 
# because the first page of MPESA statement has a summary table (list_df[0]) , 
# followed by a detailed table list_df[1]. Each subsequent page because an dataframe element in the list.


In [None]:
# First element
#df_summary = list_df[0]

In [None]:
# Rest of the elements are the detailed MPESA transactions 

#df_detail = pd.concat(list_df[1:len(list_df)],ignore_index=True)

In [None]:
# Drop last column which has no relevant data. 
# The remaining data frame now corresponds to the details of transactions

#df_detail.drop(df_detail.columns[[7]],axis = 1, inplace = True) 

In [None]:
#Clean up - rename \r to " " or "" in case of column name 'Withdraw\rn'

#df_detail.replace(to_replace=[r"\r"],value=[" "],regex=True, inplace=True)
#df_detail.rename(columns = {'Withdraw\rn':'Withdrawn'}, inplace=True)

In [None]:
#df_detail.describe()

In [None]:
# Save Files
# Define CSV file name for the converted data

#file_csv_2020H1 = "20200101_20200630.csv"
#file_csv = "mpesa_2020_2021.csv"
#file_wfch_csv = "wfch_2020_2021.csv"

# Save as CSV
#df_detail.to_csv(file_csv_2020H1)

### 2 - Processing all PDFs in a directory at once

In [None]:
# Initialize the data frame that will contain the complete set of data after conversion
df_all = pd.DataFrame()

In [None]:
# In order to list directory content we need the os package
import os

#Customize this. Assumes PDFs are not secured or password protected.
pdf_dir = ("/Users/josiah/dev/experiments-with-data/mpesa_pdf_statements/files/")

# Loop through the contents of provided directory, filter PDF, 
for pdf in os.listdir(pdf_dir):
    #Filter PDF (just checks the extension for now)
    if pdf.endswith(".pdf"):
        # read pdf into df
        print("Processing ",pdf)
        df_single = tabula.read_pdf(pdf_dir+pdf, pages='all')
        df_single_detail = pd.concat(df_single[1:len(df_single)],ignore_index=True)
        df_all = pd.concat([df_single_detail, df_all], axis=0, ignore_index=True)

In [None]:
# Drop last column which has no relevant data. The remaining data frame now corresponds to the details of transactions
df_all.drop(df_all.columns[[7]],axis = 1, inplace = True) 

In [None]:
# Clean up - rename \r to " " or "" in case of column name 'Withdraw\rn'

df_all.replace(to_replace=[r"\r"],value=[" "],regex=True, inplace=True)
df_all.rename(columns = {'Withdraw\rn':'Withdrawn'}, inplace=True)

In [None]:
df_all

### 3 - Wrangling: Format column data types.

In [None]:
#drop 'transaction status' and 'balance' as they are not necessary

df_all.drop(df_all.columns[[3,6]],axis = 1, inplace = True) 

In [None]:
# Convert Date / Time field

df_all['Completion Time'] = pd.to_datetime(df_all['Completion Time'], infer_datetime_format=True)

In [None]:
# helper function to convert string number value to a float
# adapted from https://pbpython.com/pandas_dtypes.html

def to_float(val):
    """
    Convert the string number value to a float
     - Remove $ if present
     - Remove commas
     - Convert to float type
    """
    # first check if val is a float
    if isinstance(val, float):
        return val
    else:
        new_val = val.replace(',','').replace('$', '').replace(' ','')
        return float(new_val)

In [None]:
# Convert 'Withdrawn' and 'Paid In' column number values from string to float

df_all['Withdrawn'] = df_all['Withdrawn'].apply(to_float)
df_all['Paid in'] = df_all['Paid in'].apply(to_float)

# Convert Receipt No and Details to string

df_all['Receipt No'] = df_all['Receipt No'].astype('string')

In [None]:
df_all

### 4 - More Columns from date + extract various charges

In [None]:
df = df_all

In [None]:
# Create new columns for Day, Month, Week, Year, hour
df['year'] = pd.DatetimeIndex(df['Completion Time']).year
df['month'] = pd.DatetimeIndex(df['Completion Time']).month
df['day'] = pd.DatetimeIndex(df['Completion Time']).day
df['hour'] = pd.DatetimeIndex(df['Completion Time']).hour

# Create a new column for "day of the week, with Monday=0 and ending with Sunday=6
df['dayofweek'] = pd.DatetimeIndex(df['Completion Time']).dayofweek

# Create a new column to distinguish weekdays and weekends
df['isweekend'] = pd.DatetimeIndex(df['Completion Time']).dayofweek >=5

In [None]:
# Helper function
# - input - list of strings of text we want to convert from df row to column, current df.
# - Output - DF with new columns added

def df_add_new_column(df_current, column_key, column_value,
                      column_to_search, list_of_strings):
    """
    add new column or columns to provided dataframe when string in list_of_strings is matched to column_name
     - input: dataframe, column_key, column_value to retain , column to search, list of strings to search for a match in column provided,
     - output: dataframe with added column or columns. Number of new columns added = length of list provided
    """
    # To add a check - if column_name and list_of_strings are provided.
    
    for string in list_of_strings:
        df_1 = df_current[~df_current[column_to_search].str.contains(string)].reset_index(drop=True)
        df_2 = df_current[df_current[column_to_search].str.contains(string)][[column_key,column_value]].reset_index(drop=True)
        df_2.rename(columns={column_value:string},inplace=True)
        df_current = pd.merge(df_1,df_2, on=column_key, how="left").reset_index(drop=True)
        df_current[string] = df_current[string].fillna(0)
    return df_current


In [None]:
column_key = "Receipt No"
column_value = "Withdrawn"
column_to_search = "Details"
list_of_strings = ["Pay Bill Charge","Pay Merchant Charge","Customer Transfer of Funds Charge", "Withdrawal Charge", "Customer Send Money To Unregistered User Charge"]
#string = list_of_strings[0]

In [None]:
df_pretrain = df_add_new_column(df,column_key,column_value,column_to_search,list_of_strings)

In [None]:
df_pretrain.info()

In [None]:
df_pretrain.to_csv("pretrain.csv")

In [None]:
df_pretrain

### 5 - Some Analytics

### 6 - Dig into "Details"