<a href="https://colab.research.google.com/github/axjasf/YNAB-Categorizer/blob/main/budget.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About

* This project is meant to bring all my personal finance related transactions into one easy to understand view.
* Scope / Value descriptoon
    * ...
* Mechanism
    * It reads CSV files from several US and German banks and Credit Card processors and harmonizes them into one dataframe.
    * It maps fields such as descriptions into payees
        * Lookup mechanism (positive and negative lists) against a payee config JSON file
        * Fuzzy matching against pre-determined patterns
    * It categorizes each transaction or splits it into several categories
        * by payee
        * by pre-determination of a percentage split (e.g. for Walgreens that should be sufficient, given that I have categorized transactions since 2014)
        * by semi-automatic order-item review split (e.g. for Apple or Amazon transactions where these files exist and where a split between utility and subscription or grocery, household products or general shopping is of interest)
    * It works with a set of indicator field to mark aspects of interest
        * Indicator for transactions in which automatic determinations have been taken place
        * Task field to address open tasks
        * ...

# Setup

## Installation of Libraries

*   Neccessary libraries that might not be available right away in CoLab need to be installed here.





## Loading of Libraries
* Loading of neccessary libraries such as Pandas etc.

In [1]:
import json
import pandas as pd

## Define global Variables
* Create transactions structure that ultimately will hold the transactions dataframes from all bank files
* Create overall transactions dataframe

In [2]:
# Define the transactions dataframe and load the JSON configuration for the different banks
bank_transactions = {}

bank_files = {
        "Chase": "chase.csv",
        "Wells Fargo Checking": "wells_fargo_checking.csv",
        "Apple": "apple.csv",
        "Commerzbank": "commerzbank.csv"
    }

all_transactions = []

# File Conversion

* For each bank file:
    * Load file into individual df
    * Basic quality control on the individual df level
    * Transform columns into target columns
        * Add Bank ID field as well as numberical ID field
    * Add individual df to transactions df

* Special transformations for non-US banks:
    * Date conversion
    * EUR to USD conversion based on an existing file (date and exchange rate or an API call to a free service)

In [3]:
def quality_control(df):
    missing_values = df.isnull().sum()
    column_data_types = df.dtypes

    return missing_values, column_data_types

In [4]:
def adjust_field_names(df, bank=""):

    if 'Category' in df.columns:
        df = df.rename(columns={"Category" : "oldCategory"})

 #   df.insert(0, 'ArghAccount',bank)
 #   df.insert(1, 'ArghID', range(1, len(df) + 1))
    df.insert(4, 'SplitID',"")
    df.insert(0, 'Date','')
    df.insert(1, 'Payee','')
    df.insert(2, 'Category Type','')
    df.insert(3, 'Category','')
    df.insert(4, 'chkPayee','')
    df.insert(5, 'chkCategory','')
    df.insert(5, 'chkSplit','')
    df.insert(6, 'chkEURUSD','')

#    if bank == "Commerzbank":
#        df.insert("Amount (USD)")
#        df = df.rename(columns={"Booking text" : "Description"})

    return df

## Chase

In [5]:
bank = 'Chase'
bank_transactions[bank] = pd.read_csv(bank_files[bank])

bank_transactions[bank] = adjust_field_names(bank_transactions[bank])

bank_transactions[bank]['Date'] = pd.to_datetime(bank_transactions[bank]['Transaction Date'], errors='coerce')
problematic_dates = bank_transactions[bank][bank_transactions[bank]['Date'].isna()]
missing_values, column_data_types = quality_control(bank_transactions[bank])

bank_transactions[bank] = bank_transactions[bank].drop(columns=['Post Date', 'oldCategory', 'Type', 'Memo'])
bank_transactions[bank] = bank_transactions[bank].rename(columns={"Amount" : "Amount (USD)"})

bank_transactions[bank]

Unnamed: 0,Date,Payee,Category Type,Category,chkPayee,chkSplit,chkEURUSD,chkCategory,Transaction Date,Description,SplitID,Amount (USD)
0,2023-08-31,,,,,,,,8/31/2023,AMZN Mktp US*T35JB8Y30,,-22.80
1,2023-08-31,,,,,,,,8/31/2023,AMZN Mktp US,,21.92
2,2023-08-30,,,,,,,,8/30/2023,Amazon.com*T33826K40,,-104.24
3,2023-08-29,,,,,,,,8/29/2023,Zappos.com,,59.74
4,2023-08-29,,,,,,,,8/29/2023,AMZN Mktp US,,72.22
...,...,...,...,...,...,...,...,...,...,...,...,...
934,2023-01-01,,,,,,,,1/1/2023,PAYPAL *BAYAREACOMM,,-275.00
935,2022-12-29,,,,,,,,12/29/2022,ESSO 7-ELEVEN 37870,,-59.21
936,2022-12-31,,,,,,,,12/31/2022,Audible*2H3JI2GL3,,-12.99
937,2022-12-30,,,,,,,,12/30/2022,APPLE.COM/BILL,,-15.99


## Apple

In [6]:
bank = 'Apple'
bank_transactions[bank] = pd.read_csv(bank_files[bank])

bank_transactions[bank] = adjust_field_names(bank_transactions[bank], bank)

bank_transactions[bank]['Date'] = pd.to_datetime(bank_transactions[bank]['Transaction Date'], errors='coerce')
problematic_dates = bank_transactions[bank][bank_transactions[bank]['Date'].isna()]
missing_values, column_data_types = quality_control(bank_transactions[bank])

bank_transactions[bank] = bank_transactions[bank].drop(columns=['Transaction Date', 'Clearing Date', 'Merchant', 'oldCategory', 'Type', 'Purchased By'])

## Commerzbank

In [7]:
bank = 'Commerzbank'
bank_transactions[bank] = pd.read_csv(bank_files[bank])

bank_transactions[bank] = adjust_field_names(bank_transactions[bank])


bank_transactions[bank]['Date'] = pd.to_datetime(bank_transactions[bank]['Transaction date'], errors='coerce', format='%d.%m.%Y') # For Commerzbank, Day.Month.Year
problematic_dates = bank_transactions[bank][bank_transactions[bank]['Date'].isna()]
missing_values, column_data_types = quality_control(bank_transactions[bank])

bank_transactions[bank] = bank_transactions[bank][bank_transactions[bank]['Amount'] != 0]


bank_transactions[bank].insert(7, "Amount (USD)","")

bank_transactions[bank] = bank_transactions[bank].rename(columns={"Booking text" : "Description"})

### https://www.wsj.com/market-data/quotes/fx/EURUSD/historical-prices

exchange_rates_data = pd.read_csv('eur_usd_exchange_rates.csv')

# Convert the date columns to consistent datetime format
exchange_rates_data['Date'] = pd.to_datetime(exchange_rates_data['Date'], format='%m/%d/%Y')

# Merge on the date columns to add the exchange rate to bank_transactions[bank]
bank_transactions[bank] = bank_transactions[bank].merge(exchange_rates_data[['Date', ' Close']], on='Date', how='left')

# Convert the Amount from EUR to USD
bank_transactions[bank]['Amount (USD)'] = bank_transactions[bank]['Amount'] * bank_transactions[bank][' Close']

# Drop the ' Close' column as it's not needed anymore in bank_transactions[bank]
bank_transactions[bank].drop(' Close', axis=1, inplace=True)

#bank_transactions[bank].drop(bank_transactions[bank].columns[[16, 15, 14, 13, 12, 10, 9, 8]], axis=1, inplace=True)
bank_transactions[bank] = bank_transactions[bank].drop(columns=['Transaction date', 'Value date', 'Transaction type', 'Amount', 'Account of initiator', 'Bank code of account of initiator', 'IBAN of account of initiator'])

print(bank_transactions[bank].columns)


Index(['Date', 'Payee', 'Category Type', 'Category', 'chkPayee', 'chkSplit',
       'chkEURUSD', 'Amount (USD)', 'chkCategory', 'Description', 'SplitID',
       'Currency'],
      dtype='object')


In [8]:
all_transactions = pd.concat(bank_transactions, keys=bank_transactions.keys())
all_transactions['Account-ID'] = all_transactions.index.get_level_values(0) + "-" + all_transactions.index.get_level_values(1).astype(str)

# Harmonization

* New Payee identification
    * Match payee against in a config file for payees
* Payee transformation

In [9]:
import pandas as pd
import numpy as np
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

class MerchantMatcher:
    def __init__(self, data_df):
        self.data = data_df
        self.vectorizer = self._train_vectorizer()
        self.payee_vectors = self._compute_payee_vectors()
        self.positive_list_descriptions = self._get_positive_list_descriptions()

    def _train_vectorizer(self):
        all_descriptions = [desc for descriptions in self.data['Positive List'] for desc in descriptions]
        return TfidfVectorizer().fit(all_descriptions)

    def _compute_payee_vectors(self):
        payee_vectors = {}
        for merchant, details in self.data.iterrows():
            tfidf_matrix = self.vectorizer.transform(details['Positive List'])
            avg_vector = np.asarray(tfidf_matrix.mean(axis=0))
            payee_vectors[merchant] = avg_vector
        return payee_vectors

    def _get_positive_list_descriptions(self):
        return set(desc for descriptions in self.data['Positive List'] for desc in descriptions)

    def predict_payees(self, transaction_df):
        mg_values = []
        candidates = []

        for _, row in transaction_df.iterrows():
            description = row['Description']

            if pd.isna(description) or not description.strip():
                mg_values.append(None)
                continue

            if description in self.positive_list_descriptions:
                for merchant, details in self.data.iterrows():
                    if description in details['Positive List']:
                        mg_values.append(merchant)
                        break
            else:
                description_vector = self.vectorizer.transform([description])
                similarities = {merchant: linear_kernel(description_vector, np.asarray(vector))[0][0] for merchant, vector in self.payee_vectors.items()}
                predicted_merchant = max(similarities, key=similarities.get)
                max_similarity = similarities[predicted_merchant]

                if max_similarity > self.data.loc[predicted_merchant, 'Threshold']:
                    mg_values.append(predicted_merchant)
                    candidates.append({'Payee': predicted_merchant, 'Description': description, 'Probability': max_similarity})
                else:
                    mg_values.append(None)

        transaction_df['Payee'] = mg_values
        transaction_df['chkPaye'] = 'X'
        candidates_df = pd.DataFrame(candidates)
        return transaction_df, candidates_df

# Sample Usage
data_df = pd.read_json("payee_matching.json", orient="index")  # Replace with your DataFrame loading mechanism

matcher = MerchantMatcher(data_df)
predicted_df, candidates_df = matcher.predict_payees(all_transactions)

if os.path.exists("merchant_guess.csv"): os.remove("merchant_guess.csv")
if os.path.exists("candidates.csv"): os.remove("candidates.csv")
predicted_df.to_csv("merchant_guess.csv", index=False)
candidates_df.to_csv("candidates.csv", index=False)



# Categorization

* Transactions <--> Payee mapping (1:1)
* Transactions <--> Amazon Orders mapping and splitting
* Transactions <--> Apple Orders mapping and splitting
* Transactions <--> Walgreens splitting

In [10]:
# Transactions <--> Payee mapping (1:1)

with open('payee_matching.json', 'r') as file:
    payee_data = json.load(file)

# List to hold split transactions
split_transactions = []

# Iterate over each row in the dataframe
for idx, row in all_transactions.iterrows():
    payee = row['Payee']

    # Check if payee exists in the JSON data
    if payee in payee_data:
        categories = payee_data[payee]['Categories']
#        input("Payee found")
        print(payee)
        print(categories)
#        input("Payee found end")

        # If only one category exists, update the row's category columns
        if len(categories) == 1:
            row['Category Type'] = categories[0]['Category Type']
            row['Category'] = categories[0]['Category']
            row['chkCategory'] = 'A'

        # If multiple categories exist, create split transactions
        elif len(categories) > 1:
 #           input("Split Begin")
            for idx_split, category in enumerate(categories, start=1):
                print(row)
                new_row = row.copy()
                new_row['Category Type'] = category['Category Type']
                new_row['Category'] = category['Category']
                new_row['SplitID'] = str(row['Account-ID']) + '-' + str(idx_split)
                row['chkCategory'] = 'A'
                # Update the 'Amount (USD)' based on the percentage split from the JSON
                new_row['Amount (USD)'] = row['Amount (USD)'] * category.get('Percentage', 1)
                print(new_row)

                split_transactions.append(new_row)
#            input("Split End")

# Append the split transactions to the main dataframe
all_transactions = pd.concat([all_transactions, pd.DataFrame(split_transactions)], ignore_index=False)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
chkPayee                                                             
chkSplit                                                             
chkEURUSD                                                            
chkCategory                                                          
Transaction Date                                                  NaN
Description         AMAZON.COM*R30JF0II3 A440 TERRY AVE N. AMZN.CO...
SplitID                                                              
Amount (USD)                                                    57.46
Currency                                                          NaN
Account-ID                                                  Apple-195
chkPaye                                                             X
Name: (Apple, 195), dtype: object
Date                                              2023-06-10 00:00:00
Payee                                                        

In [11]:
all_transactions

Unnamed: 0,Unnamed: 1,Date,Payee,Category Type,Category,chkPayee,chkSplit,chkEURUSD,chkCategory,Transaction Date,Description,SplitID,Amount (USD),Currency,Account-ID,chkPaye
Chase,0,2023-08-31,Amazon,,,,,,,8/31/2023,AMZN Mktp US*T35JB8Y30,,-22.80,,Chase-0,X
Chase,1,2023-08-31,Amazon,,,,,,,8/31/2023,AMZN Mktp US,,21.92,,Chase-1,X
Chase,2,2023-08-30,Amazon,,,,,,,8/30/2023,Amazon.com*T33826K40,,-104.24,,Chase-2,X
Chase,3,2023-08-29,Zappos,,,,,,,8/29/2023,Zappos.com,,59.74,,Chase-3,X
Chase,4,2023-08-29,Amazon,,,,,,,8/29/2023,AMZN Mktp US,,72.22,,Chase-4,X
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Commerzbank,63,2022-11-28,Margret Janssen,Expense,Vacations,,,,A,,MARGRET JANsEN Liebe Weihnachtsgrusse von Mama...,Commerzbank-63-2,,EUR,Commerzbank-63,X
Commerzbank,70,2022-10-07,Margret Janssen,Savings,Invest,,,,,,MARGRET JANsEN Happy Birthday lieber Axel End-...,Commerzbank-70-1,,EUR,Commerzbank-70,X
Commerzbank,70,2022-10-07,Margret Janssen,Expense,Vacations,,,,A,,MARGRET JANsEN Happy Birthday lieber Axel End-...,Commerzbank-70-2,,EUR,Commerzbank-70,X
Commerzbank,74,2022-09-28,Margret Janssen,Savings,Invest,,,,,,MARGRET JANsEN Happy Birthday lieber Max End-t...,Commerzbank-74-1,,EUR,Commerzbank-74,X


# Output



*   Export transactions dataframe into CSV



In [12]:
if os.path.exists("output.csv"): os.remove("output.csv")
all_transactions.to_csv("output.csv", index=True)