<a href="https://colab.research.google.com/github/axjasf/YNAB-Categorizer/blob/main/budget.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About

* This project is meant to bring all my personal finance related transactions into one easy to understand view.
* Scope / Value descriptoon
    * ...
* Mechanism
    * It reads CSV files from several US and German banks and Credit Card processors and harmonizes them into one dataframe.
    * It maps fields such as descriptions into payees
        * Lookup mechanism (positive and negative lists) against a payee config JSON file
        * Fuzzy matching against pre-determined patterns
    * It categorizes each transaction or splits it into several categories
        * by payee
        * by pre-determination of a percentage split (e.g. for Walgreens that should be sufficient, given that I have categorized transactions since 2014)
        * by semi-automatic order-item review split (e.g. for Apple or Amazon transactions where these files exist and where a split between utility and subscription or grocery, household products or general shopping is of interest)
    * It works with a set of indicator field to mark aspects of interest
        * Indicator for transactions in which automatic determinations have been taken place
        * Task field to address open tasks
        * ...

# Setup

## Installation of Libraries

*   Neccessary libraries that might not be available right away in CoLab need to be installed here.





## Loading of Libraries
* Loading of neccessary libraries such as Pandas etc.

In [422]:
import json
import pandas as pd

## Define global Variables
* Create transactions structure that ultimately will hold the transactions dataframes from all bank files
* Create overall transactions dataframe

In [423]:
# Define the transactions dataframe and load the JSON configuration for the different banks
bank_transactions = {}

bank_files = {
        "Chase": "chase.csv",
        "Wells Fargo Checking": "wells_fargo_checking.csv",
        "Apple": "apple.csv",
        "Commerzbank": "commerzbank.csv"
    }

all_transactions = []

# File Conversion

* For each bank file:
    * Load file into individual df
    * Basic quality control on the individual df level
    * Transform columns into target columns
        * Add Bank ID field as well as numberical ID field
    * Add individual df to transactions df

* Special transformations for non-US banks:
    * Date conversion
    * EUR to USD conversion based on an existing file (date and exchange rate or an API call to a free service)

In [424]:
def quality_control(df):
    missing_values = df.isnull().sum()
    column_data_types = df.dtypes

    return missing_values, column_data_types

In [425]:
def adjust_field_names(df, bank=""):

    if 'Category' in df.columns:
        df = df.rename(columns={"Category" : "oldCategory"})

    df.insert(4, 'SplitID',"")
    df.insert(0, 'Date','')
    df.insert(1, 'Payee','')
    df.insert(2, 'Category Type','')
    df.insert(3, 'Category','')
    df.insert(4, 'chkPayee','')
    df.insert(5, 'chkCategory','')
    df.insert(6, 'chkSplit','')
    df.insert(7, 'chkEURUSD','')

#    if bank == "Commerzbank":
#        df.insert("Amount (USD)")
#        df = df.rename(columns={"Booking text" : "Description"})

    return df

## Chase

In [426]:
bank = 'Chase'
bank_transactions[bank] = pd.read_csv(bank_files[bank])

bank_transactions[bank] = adjust_field_names(bank_transactions[bank])

bank_transactions[bank]['Date'] = pd.to_datetime(bank_transactions[bank]['Transaction Date'], errors='coerce')
problematic_dates = bank_transactions[bank][bank_transactions[bank]['Date'].isna()]
missing_values, column_data_types = quality_control(bank_transactions[bank])

bank_transactions[bank] = bank_transactions[bank].drop(columns=['Post Date', 'oldCategory', 'Type', 'Memo', 'Transaction Date'])
bank_transactions[bank] = bank_transactions[bank].rename(columns={"Amount" : "Amount (USD)"})


## Apple

In [427]:
bank = 'Apple'
bank_transactions[bank] = pd.read_csv(bank_files[bank])

bank_transactions[bank] = adjust_field_names(bank_transactions[bank], bank)

bank_transactions[bank]['Date'] = pd.to_datetime(bank_transactions[bank]['Transaction Date'], errors='coerce')
problematic_dates = bank_transactions[bank][bank_transactions[bank]['Date'].isna()]
missing_values, column_data_types = quality_control(bank_transactions[bank])

bank_transactions[bank] = bank_transactions[bank].drop(columns=['Transaction Date', 'Clearing Date', 'Merchant', 'oldCategory', 'Type', 'Purchased By'])

## Commerzbank

In [428]:
bank = 'Commerzbank'
bank_transactions[bank] = pd.read_csv(bank_files[bank])

bank_transactions[bank] = adjust_field_names(bank_transactions[bank])


bank_transactions[bank]['Date'] = pd.to_datetime(bank_transactions[bank]['Transaction date'], errors='coerce', format='%d.%m.%Y') # For Commerzbank, Day.Month.Year
problematic_dates = bank_transactions[bank][bank_transactions[bank]['Date'].isna()]
missing_values, column_data_types = quality_control(bank_transactions[bank])

bank_transactions[bank] = bank_transactions[bank][bank_transactions[bank]['Amount'] != 0]


bank_transactions[bank].insert(7, "Amount (USD)","")

bank_transactions[bank] = bank_transactions[bank].rename(columns={"Booking text" : "Description"})

### https://www.wsj.com/market-data/quotes/fx/EURUSD/historical-prices

exchange_rates_data = pd.read_csv('eur_usd_exchange_rates.csv')

# Convert the date columns to consistent datetime format
exchange_rates_data['Date'] = pd.to_datetime(exchange_rates_data['Date'], format='%m/%d/%Y')

# Merge on the date columns to add the exchange rate to bank_transactions[bank]
bank_transactions[bank] = bank_transactions[bank].merge(exchange_rates_data[['Date', ' Close']], on='Date', how='left')

# Convert the Amount from EUR to USD
bank_transactions[bank]['Amount (USD)'] = bank_transactions[bank]['Amount'] * bank_transactions[bank][' Close']

# Drop the ' Close' column as it's not needed anymore in bank_transactions[bank]
bank_transactions[bank].drop(' Close', axis=1, inplace=True)

#bank_transactions[bank].drop(bank_transactions[bank].columns[[16, 15, 14, 13, 12, 10, 9, 8]], axis=1, inplace=True)
bank_transactions[bank] = bank_transactions[bank].drop(columns=['Transaction date', 'Value date', 'Transaction type', 'Amount', 'Account of initiator', 'Bank code of account of initiator', 'IBAN of account of initiator'])


In [429]:
all_transactions = pd.concat(bank_transactions, keys=bank_transactions.keys())
all_transactions['Account-ID'] = all_transactions.index.get_level_values(0) + "-" + all_transactions.index.get_level_values(1).astype(str)

# Payees

## Payee Harmonization

* New Payee identification
    * Match payee against in a config file for payees
* Payee transformation

In [430]:
import pandas as pd
import numpy as np
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

class MerchantMatcher:
    def __init__(self, data_df):
        self.data = data_df
        self.vectorizer = self._train_vectorizer()
        self.payee_vectors = self._compute_payee_vectors()
        self.positive_list_descriptions = self._get_positive_list_descriptions()

    def _match_prefix(self, description, merchant_details):
        prefix_length = merchant_details.get('Prefix Length', 50)
        short_description = description[:prefix_length].lower()
        positive_list = [desc[:prefix_length].lower() for desc in merchant_details['Positive List']]

        for payee_description in positive_list:
            if short_description.startswith(payee_description):
                return True
        return False




    def _train_vectorizer(self):
        all_descriptions = [desc.lower() for descriptions in self.data['Positive List'] for desc in descriptions]
        return TfidfVectorizer().fit(all_descriptions)

    def _compute_payee_vectors(self):
        payee_vectors = {}
        for merchant, details in self.data.iterrows():
            tfidf_matrix = self.vectorizer.transform([desc.lower() for desc in details['Positive List']])
            avg_vector = np.asarray(tfidf_matrix.mean(axis=0))
            payee_vectors[merchant] = avg_vector
        return payee_vectors

    def _get_positive_list_descriptions(self):
        return set(desc.lower() for descriptions in self.data['Positive List'] for desc in descriptions)

    def predict_payees(self, transaction_df):
        mg_values = []
        chkpayee_values = []
        candidates = []

        for _, row in transaction_df.iterrows():
            description_lower = row['Description'].lower() if row['Description'] else None
            current_merchant = None
            current_chkpayee = None

            if pd.isna(description_lower) or not description_lower.strip():
                mg_values.append(None)
                chkpayee_values.append(None)
                continue

            for merchant, details in self.data.iterrows():
                if description_lower in [desc.lower() for desc in details['Positive List']]:

                    current_merchant = merchant
                    current_chkpayee = 'A'
                    break

                # Check for prefix matching
                for payee_description in details['Positive List']:
                    if len(payee_description) > details.get('Prefix Length', 50) and \
                        description_lower.startswith(payee_description[:details.get('Prefix Length', 50)].lower()):
                        current_merchant = merchant
                        current_chkpayee = 'P'
                        break

                if current_merchant:
                    break

            if not current_merchant:
                description_vector = self.vectorizer.transform([description_lower])
                similarities = {merchant: linear_kernel(description_vector, np.asarray(vector))[0][0] for merchant, vector in self.payee_vectors.items()}
                predicted_merchant = max(similarities, key=similarities.get)
                max_similarity = similarities[predicted_merchant]

                if max_similarity > self.data.loc[predicted_merchant, 'Threshold']:
                    candidates.append({'Payee': predicted_merchant, 'Description': row['Description'], 'Probability': max_similarity})


            mg_values.append(current_merchant)
            chkpayee_values.append(current_chkpayee or 'C')

        transaction_df['Payee'] = mg_values
        transaction_df['chkPayee'] = chkpayee_values
        candidates_df = pd.DataFrame(candidates)
        return transaction_df, candidates_df




# Sample Usage
data_df = pd.read_json("payee_matching.json", orient="index")  # Replace with your DataFrame loading mechanism

matcher = MerchantMatcher(data_df)
predicted_df, candidates_df = matcher.predict_payees(all_transactions)

if os.path.exists("merchant_guess.csv"): os.remove("merchant_guess.csv")
if os.path.exists("candidates.csv"): os.remove("candidates.csv")
predicted_df.to_csv("merchant_guess.csv", index=False)
candidates_df.to_csv("candidates.csv", index=False)



# Categories

* Transactions <--> Payee mapping (1:1)
* Transactions <--> Amazon Orders mapping and splitting
* Transactions <--> Apple Orders mapping and splitting
* Transactions <--> Walgreens splitting

## Direct assignment

In [431]:
# Transactions <--> Payee mapping (1:1)

with open('payee_matching.json', 'r') as file:
    payee_data = json.load(file)

# List to hold split transactions
split_transactions = []

# Iterate over each row in the dataframe
for idx, row in all_transactions.iterrows():
    payee = row['Payee']

    # Check if payee exists in the JSON data
    if payee in payee_data:
        categories = payee_data[payee]['Categories']

        # If no category exists, update the row's category columns
        if len(categories) == 0:
            all_transactions.at[idx, 'chkCategory'] = 'E'

        # If only one category exists, update the row's category columns
        if len(categories) == 1:
            all_transactions.at[idx, 'Category Type'] = categories[0]['Category Type']
            all_transactions.at[idx, 'Category'] = categories[0]['Category']
            all_transactions.at[idx, 'chkCategory'] = 'A'

        # If multiple categories exist, create split transactions
        elif len(categories) > 1:
            all_transactions.at[idx, 'Category Type'] = ''  # Empty the master row's category columns
            all_transactions.at[idx, 'Category'] = ''
            all_transactions.at[idx, 'SplitID'] = str(row['Account-ID']) + '-' + 'M'
            all_transactions.at[idx, 'chkCategory'] = 'A'

            for idx_split, category in enumerate(categories, start=1):
                new_row = row.copy()
                new_row['Category Type'] = category['Category Type']
                new_row['Category'] = category['Category']
                new_row['SplitID'] = str(row['Account-ID']) + '-' + 'S' + str(idx_split-1)
                new_row['chkCategory'] = 'A'

                # Update the 'Amount (USD)' based on the percentage split from the JSON
                new_row['Amount (USD)'] = row['Amount (USD)'] * category.get('Percentage', 1)

                split_transactions.append(new_row)

# Append the split transactions to the main dataframe
all_transactions = pd.concat([all_transactions, pd.DataFrame(split_transactions)], ignore_index=False)

In [432]:
#all_transactions

# Output

## Dataframe preparation

In [433]:
# Reorder Columns

all_transactions = all_transactions[[
    'Date',
    'Account-ID',
    'SplitID',
    'Payee',
    'Category Type',
    'Category',
    'Amount (USD)',
    'Description',
    'chkPayee',
    'chkCategory',
    'chkEURUSD']]

# Sort rows
all_transactions = all_transactions.sort_values(by=['Date', 'Account-ID', 'SplitID'], ascending=[False, True, True])

# Formating
all_transactions['Amount (USD)'] = all_transactions['Amount (USD)'].round(2)
all_transactions['Amount (USD)'] = all_transactions['Amount (USD)'].apply(lambda x: "${:,.2f}".format(x))


## Output file generation

In [434]:
if os.path.exists("output.csv"): os.remove("output.csv")
all_transactions.to_csv("output.csv", index=False)