<a href="https://colab.research.google.com/github/axjasf/YNAB-Categorizer/blob/main/budget.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About

* This project is meant to bring all my personal finance related transactions into one easy to understand view.
* Scope / Value descriptoon
    * ...
* Mechanism
    * It reads CSV files from several US and German banks and Credit Card processors and harmonizes them into one dataframe.
    * It maps fields such as descriptions into payees
        * Lookup mechanism (positive and negative lists) against a payee config JSON file
        * Fuzzy matching against pre-determined patterns
    * It categorizes each transaction or splits it into several categories
        * by payee
        * by pre-determination of a percentage split (e.g. for Walgreens that should be sufficient, given that I have categorized transactions since 2014)
        * by semi-automatic order-item review split (e.g. for Apple or Amazon transactions where these files exist and where a split between utility and subscription or grocery, household products or general shopping is of interest)
    * It works with a set of indicator field to mark aspects of interest
        * Indicator for transactions in which automatic determinations have been taken place
        * Task field to address open tasks
        * ...

# Setup

## Installation of Libraries

*   Neccessary libraries that might not be available right away in CoLab need to be installed here.





## Loading of Libraries
* Loading of neccessary libraries such as Pandas etc.

In [91]:
import json
import pandas as pd

## Define global Variables
* Create transactions structure that ultimately will hold the transactions dataframes from all bank files
* Create overall transactions dataframe

In [92]:
# Define the transactions dataframe and load the JSON configuration for the different banks
bank_transactions = {}

bank_files = {
        "Chase": "chase.csv",
        "Wells Fargo Checking": "wells_fargo_checking.csv",
        "Apple": "apple.csv",
        "Commerzbank": "commerzbank.csv"
    }

# File Conversion

* For each bank file:
    * Load file into individual df
    * Basic quality control on the individual df level
    * Transform columns into target columns
        * Add Bank ID field as well as numberical ID field
    * Add individual df to transactions df

* Special transformations for non-US banks:
    * Date conversion
    * EUR to USD conversion based on an existing file (date and exchange rate or an API call to a free service)

In [106]:
def quality_control(df):
    missing_values = df.isnull().sum()
    column_data_types = df.dtypes

    return missing_values, column_data_types

In [107]:
def adjust_field_names(df, bank=""):
    #rename original columns to "org_" + fieldname
    #df.columns = ['org_' + col for col in df.columns]

    if bank == "Chase":
        df.rename(columns={"Category" : "oldCategory"})

    df.insert(0, 'Account',bank)
    df.insert(1, 'ID', range(1, len(df) + 1))
    df.insert(2, 'SubID',"")
    df.insert(3, 'Date','')
    df.insert(4, 'Payee','')
    df.insert(5, 'Category Type','')
    df.insert(6, 'Category','')

    if bank == "Commerzbank":
        df.insert("Amount (USD)")

    return df

In [108]:
bank = 'Chase'
bank_transactions[bank] = pd.read_csv(bank_files[bank])

bank_transactions[bank] = adjust_field_names(bank_transactions[bank])

bank_transactions[bank]['Date'] = pd.to_datetime(bank_transactions[bank]['Transaction Date'], errors='coerce')
problematic_dates = bank_transactions[bank][bank_transactions[bank]['Date'].isna()]
missing_values, column_data_types = quality_control(bank_transactions[bank])

bank_transactions[bank]

ValueError: ignored

In [None]:
bank = 'Apple'
bank_transactions[bank] = pd.read_csv(bank_files[bank])

bank_transactions[bank] = add_new_fields(bank_transactions[bank])

bank_transactions[bank]['Date'] = pd.to_datetime(bank_transactions[bank]['Transaction Date'], errors='coerce')
problematic_dates = bank_transactions[bank][bank_transactions[bank]['Date'].isna()]
missing_values, column_data_types = quality_control(bank_transactions[bank])

In [None]:
bank = 'Commerzbank'
bank_transactions[bank] = pd.read_csv(bank_files[bank])

bank_transactions[bank] = add_new_fields(bank_transactions[bank])


bank_transactions[bank]['Date'] = pd.to_datetime(bank_transactions[bank]['Transaction date'], errors='coerce', format='%d.%m.%Y') # For Commerzbank, Day.Month.Year
problematic_dates = bank_transactions[bank][bank_transactions[bank]['Date'].isna()]
missing_values, column_data_types = quality_control(bank_transactions[bank])

### https://www.wsj.com/market-data/quotes/fx/EURUSD/historical-prices

exchange_rates_data = pd.read_csv('eur_usd_exchange_rates.csv')

# Convert the date columns to consistent datetime format
exchange_rates_data['Date'] = pd.to_datetime(exchange_rates_data['Date'], format='%m/%d/%Y')

# Merge on the date columns to add the exchange rate to bank_transactions[bank]
bank_transactions[bank] = bank_transactions[bank].merge(exchange_rates_data[['Date', ' Close']], on='Date', how='left')

# Convert the Amount from EUR to USD
bank_transactions[bank]['Amount_in_USD'] = bank_transactions[bank]['Amount'] * bank_transactions[bank][' Close']

# Drop the ' Close' column as it's not needed anymore in bank_transactions[bank]
bank_transactions[bank].drop(' Close', axis=1, inplace=True)

# Display the result with the new Amount_in_USD column
bank_transactions[bank][['Date', 'Amount', 'Amount_in_USD']].head()



bank_transactions[bank]

# Harmonization

* New Payee identification
    * Match payee against in a config file for payees
* Payee transformation

In [None]:
import pandas as pd
import numpy as np
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

class MerchantMatcher:
    def __init__(self, data_df):
        self.data = data_df
        self.vectorizer = self._train_vectorizer()
        self.payee_vectors = self._compute_payee_vectors()
        self.positive_list_descriptions = self._get_positive_list_descriptions()

    def _train_vectorizer(self):
        all_descriptions = [desc for descriptions in self.data['Positive List'] for desc in descriptions]
        return TfidfVectorizer().fit(all_descriptions)

    def _compute_payee_vectors(self):
        payee_vectors = {}
        for merchant, details in self.data.iterrows():
            tfidf_matrix = self.vectorizer.transform(details['Positive List'])
            avg_vector = np.asarray(tfidf_matrix.mean(axis=0))
            payee_vectors[merchant] = avg_vector
        return payee_vectors

    def _get_positive_list_descriptions(self):
        return set(desc for descriptions in self.data['Positive List'] for desc in descriptions)

    def predict_payees(self, transaction_df):
        mg_values = []
        candidates = []

        for _, row in transaction_df.iterrows():
            description = row['Description']

            if pd.isna(description) or not description.strip():
                mg_values.append(None)
                continue

            if description in self.positive_list_descriptions:
                for merchant, details in self.data.iterrows():
                    if description in details['Positive List']:
                        mg_values.append(merchant)
                        break
            else:
                description_vector = self.vectorizer.transform([description])
                similarities = {merchant: linear_kernel(description_vector, np.asarray(vector))[0][0] for merchant, vector in self.payee_vectors.items()}
                predicted_merchant = max(similarities, key=similarities.get)
                max_similarity = similarities[predicted_merchant]

                if max_similarity > self.data.loc[predicted_merchant, 'Threshold']:
                    mg_values.append(predicted_merchant)
                    candidates.append({'Payee': predicted_merchant, 'Description': description, 'Probability': max_similarity})
                else:
                    mg_values.append(None)

        transaction_df['Merchant'] = mg_values
        candidates_df = pd.DataFrame(candidates)
        return transaction_df, candidates_df

# Sample Usage
data_df = pd.read_json("payee_matching.json", orient="index")  # Replace with your DataFrame loading mechanism
transactions_df = bank_transactions['Chase']  # Replace with your transaction DataFrame

matcher = MerchantMatcher(data_df)
predicted_df, candidates_df = matcher.predict_payees(transactions_df)

if os.path.exists("merchant_guess.csv"): os.remove("merchant_guess.csv")
if os.path.exists("candidates.csv"): os.remove("candidates.csv")
predicted_df.to_csv("merchant_guess.csv", index=False)
candidates_df.to_csv("candidates.csv", index=False)



# Categorization

* Transactions <--> Payee mapping (1:1)
* Transactions <--> Amazon Orders mapping and splitting
* Transactions <--> Apple Orders mapping and splitting
* Transactions <--> Walgreens splitting

# Output



*   Export transactions dataframe into CSV

