### Plan for `replace_tokens(description)`

This function takes a raw transaction description and returns a cleaned version with general class tokens. These replacements help reduce sparsity and improve classification accuracy.

#### Step-by-step plan:

1. **Normalize the text**
   - Convert to lowercase to handle case-insensitive matching

2. **Replace known merchant names**
   - Use lookup lists for:
     - `[Grocer]`: Woolworths, PNP, Spar, Checkers, Superspar
     - `[Restaurant]`: KFC, Steers, Uber Eats, Fournos, Marble Pantry, Zapper
     - `[FuelBrand]`: Engen, Sasol, Shell

3. **Replace location/branch names**
   - Common names seen across examples: Craighall, Parkhurst, Castle Gate, etc.
   - Replace with `[Location]` if they appear

4. **Replace embedded dates**
   - Detect date patterns like `"* 14 Oct"` or `"* 21 Sep"`
   - Replace with `[Date]` using a regex pattern

5. **Return the modified description**


In [82]:
import pandas as pd

df = pd.read_csv("../data/transactions.csv")
df.head(10)

Unnamed: 0,description,amount,label,balance,date
0,Fusion Interest Rebate,22.85,Interest Received,15737.35,2023-10-10 00:00:00
1,Rtc Express Pmt To Battlezone Party Lance 3 Sep,-650.0,Entertainment,44479.06,2023-08-31 00:00:00
2,POS PURCHASE (EFFEC ) DJS BILTONG (PTY) LTD NO...,-90.00,Fast Food & Takeouts,38 624.93,2023-10-27T00:00:00
3,POS Purchase Engen Mitchell Park * 07 Oct,-377.17,Fuel,59485.18,2023-10-10 00:00:00
4,POS Purchase Woolworths Online * 23 Aug,-1531.85,Groceries,72873.35,2023-08-25 00:00:00
5,POS Purchase Engen Nkandla * 05 Oct,-27.85,General Purchases,74154.54,2023-10-07 00:00:00
6,SMS Notification Fee Branch:,470010 -0.80,Bank Charges and Fees,84.08,2021-10-03T00:00:00
7,Banking App Payment,-200.00,Bank Transfer,111.86,2021-09-28T00:00:00
8,Eft Debit Order Payment (): Cartrack,-199.00,Transport,2 887.60,2021-10-25T00:00:00
9,Magtape Debit MTN Sp,-1797.99,Cellular Data Purchase,6809.59,2023-08-31 00:00:00


In [83]:
raw_description = df['description'].iloc[0]
print(f"raw description : {raw_description}")
normalized_description = raw_description.strip().lower()
print(f"normalized description : {normalized_description}")

raw description : Fusion Interest Rebate
normalized description : fusion interest rebate


### Step 1: Normalize Descriptions

I started by converting all descriptions to lowercase and removing leading/trailing spaces. This ensures consistent matching for vendor names and patterns, as the dataset contained varied casing (e.g., "Woolworths" vs "woolworths") and formatting inconsistencies.


In [84]:
df['description'] = df['description'].str.strip().str.lower()
df.head(5)

Unnamed: 0,description,amount,label,balance,date
0,fusion interest rebate,22.85,Interest Received,15737.35,2023-10-10 00:00:00
1,rtc express pmt to battlezone party lance 3 sep,-650.0,Entertainment,44479.06,2023-08-31 00:00:00
2,pos purchase (effec ) djs biltong (pty) ltd no...,-90.0,Fast Food & Takeouts,38 624.93,2023-10-27T00:00:00
3,pos purchase engen mitchell park * 07 oct,-377.17,Fuel,59485.18,2023-10-10 00:00:00
4,pos purchase woolworths online * 23 aug,-1531.85,Groceries,72873.35,2023-08-25 00:00:00


### Step 2: Replacing Known Vendor Names

Using EDA, I created vendor lists for Grocers, Restaurants, and Fuel stations. These were based on recurring merchant names in the `description` field. I used regular expressions to replace them with class tokens:
- `[grocers]`
- `[restaurant]`
- `[garage]`

This reduces token noise and helps generalize the model input.


In [85]:
# Known vendor name lists for replacement

GROCERS = [
    "woolworths", "checkers", "pnp", "spar", "superspar"
]

RESTAURANTS = [
    "kfc", "steers", "uber eats", "marble pantry", "zapper", "fournos", "nice on 4th"
]

GARAGE = [
    "engen", "shell", "sasol"
]


In [86]:
example_desc = df[df['label'] == 'Groceries']['description'].iloc[0]
example_desc

'pos purchase woolworths online  * 23 aug'

In [87]:
for grocers in GROCERS:
    if grocers in example_desc:
        example_desc = example_desc.replace(grocers , '[grocers]' )
example_desc

'pos purchase [grocers] online  * 23 aug'

### Replacing Known Vendor Names Using Regex

To robustly replace known vendor names in transaction descriptions, I used a regular expression pattern built from a list of vendor keywords.

- `\b` word boundaries ensure we match whole words only
- `re.escape()` handles vendor names safely
- `str.replace(..., regex=True)` applies the replacements

This reduces the chance of false matches and standardizes rare vendor mentions into common class tokens like `[grocers]`.


In [88]:
import re

# \b ensures whole-word matching
pattern = '|'.join([fr'\b{re.escape(g)}\b' for g in GROCERS])
df['description'] = df['description'].str.replace(pattern, '[grocers]', regex=True)
df[df['label'] == 'Groceries']['description'].head(10)


4               pos purchase [grocers] online  * 23 aug
23          pos purchase [grocers] ballito ju  * 15 oct
24     pos purchase [grocers] fam craighall p  * 24 aug
26    pos purchase (effec ) [grocers] fam elgin kemp...
27    pos purchase (effec ) [grocers] fam elgin kemp...
28                     pos purchase [grocers]  * 12 oct
40           pos purchase lekker biltong ball  * 18 aug
53            pos purchase [grocers] tiffanys  * 22 sep
66                       pos purchase pnpasap  * 13 oct
84                     pos purchase [grocers]  * 31 aug
Name: description, dtype: object

“This  lists are not exhaustive. Only common or high-frequency vendor names are replaced to preserve data quality without overfitting to one-off cases.”

### Step 3: Replacing Embedded Dates

Most `description` fields ended with a date in the format `* DD MMM`, like `* 14 Oct`. These date values added unnecessary variance and were replaced using a regex pattern with `[date]`. This transformation further reduced the token space while preserving temporal structure.


In [89]:


def replace_vendors(text):
    # Replace grocers
    grocer_pattern = '|'.join([fr'\b{re.escape(g)}\b' for g in GROCERS])
    text = re.sub(grocer_pattern, '[grocers]', text)
    
    # Replace restaurants
    restaurant_pattern = '|'.join([fr'\b{re.escape(r)}\b' for r in RESTAURANTS])
    text = re.sub(restaurant_pattern, '[restaurant]', text)

    # Replace garage/fuel brands
    fuel_pattern = '|'.join([fr'\b{re.escape(f)}\b' for f in GARAGE])
    text = re.sub(fuel_pattern, '[garage]', text)

    # Replace in-description dates like "* 23 aug"
    date_pattern = r"\*\s*\d{1,2}\s+[a-zA-Z]{3}"
    text = re.sub(date_pattern, '[date]', text)

    return text


# Apply to column
df['description'] = df['description'].apply(replace_vendors)

# Preview
df[['description', 'label']].head(20)


Unnamed: 0,description,label
0,fusion interest rebate,Interest Received
1,rtc express pmt to battlezone party lance 3 sep,Entertainment
2,pos purchase (effec ) djs biltong (pty) ltd no...,Fast Food & Takeouts
3,pos purchase [garage] mitchell park [date],Fuel
4,pos purchase [grocers] online [date],Groceries
5,pos purchase [garage] nkandla [date],General Purchases
6,sms notification fee branch:,Bank Charges and Fees
7,banking app payment,Bank Transfer
8,eft debit order payment (): cartrack,Transport
9,magtape debit mtn sp,Cellular Data Purchase


In [90]:
example_desc = df['description'].iloc[3]
example_desc

'pos purchase [garage] mitchell park  [date]'

In [91]:
df[df['label'].isin(['Fuel', 'Genereal Purchases', 'Eating Out'])][['description', 'label']].head(50)


Unnamed: 0,description,label
3,pos purchase [garage] mitchell park [date],Fuel
17,pos purchase mcd rivonia (53) [date],Eating Out
22,pos purchase [restaurant] - arc [date],Eating Out
37,pos purchase [restaurant] ultra south [date],Eating Out
38,pos purchase payfast *bagel [date],Eating Out
39,pos purchase [restaurant] [date],Eating Out
45,pos purchase yoco *boba spotea [date],Eating Out
50,pos purchase bryanston country c [date],Eating Out
59,pos purchase [restaurant] - arc [date],Eating Out
61,pos purchase [restaurant] *sykes slush [date],Eating Out


In [92]:
def add_location_token_if_eating_out(row):
    text = row['description']
    label = row['label']

    # Only apply if it's Eating Out and matches expected structure
    if label == 'Eating Out' and '[restaurant]' in text and '[date]' in text:
        # Replace anything between [restaurant] and [date] with [location]
        pattern = r"(\[restaurant\])\s+(.+?)\s+(\[date\])"
        text = re.sub(pattern, r"\1 [location] \3", text)

    return text

df['description'] = df.apply(add_location_token_if_eating_out, axis=1)


In [93]:
df[df['label'].isin(['Fuel', 'Genereal Purchases', 'Eating Out'])][['description', 'label']].head(50)

Unnamed: 0,description,label
3,pos purchase [garage] mitchell park [date],Fuel
17,pos purchase mcd rivonia (53) [date],Eating Out
22,pos purchase [restaurant] [location] [date],Eating Out
37,pos purchase [restaurant] [location] [date],Eating Out
38,pos purchase payfast *bagel [date],Eating Out
39,pos purchase [restaurant] [date],Eating Out
45,pos purchase yoco *boba spotea [date],Eating Out
50,pos purchase bryanston country c [date],Eating Out
59,pos purchase [restaurant] [location] [date],Eating Out
61,pos purchase [restaurant] [location] [date],Eating Out


### Step 4: Replacing Location Segments

We observed consistent patterns like:
- `[restaurant] craighall [date]`
- `[garage] mitchell park [date]`

This inspired me to extract location-like segments that appear between `[restaurant]`/`[garage]` and `[date]`, replacing them with `[location]`. This transformation standardizes branch names and further reduces feature sparsity.


In [98]:
def add_tokens(row):
    text = row['description']
    label = row['label']

    # --- Eating Out ---
    if label == 'Eating Out' and 'pos purchase' in text and '[date]' in text:

        # Case 1: Restaurant already present → insert location
        if '[restaurant]' in text:
            pattern = r"(\[restaurant\])\s+(.+?)\s+(\[date\])"
            text = re.sub(pattern, r"\1 [location] \3", text)

        # Case 2: No restaurant token yet → assume restaurant
        elif '[restaurant]' not in text:
            pattern = r"(pos purchase)\s+(.+?)\s+(\[date\])"
            text = re.sub(pattern, r"\1 [restaurant] \3", text)

    # --- Fuel ---
    elif label == 'Fuel' and '[garage]' in text and '[date]' in text:
        pattern = r"(\[garage\])\s+(.+?)\s+(\[date\])"
        text = re.sub(pattern, r"\1 [location] \3", text)

    return text


df['description'] = df.apply(add_tokens, axis=1)


In [101]:
df[df['label'].isin(['Fuel', 'General Purchases', 'Eating Out'])][['description', 'label']].head(50)

Unnamed: 0,description,label
3,pos purchase [garage] [location] [date],Fuel
5,pos purchase [garage] nkandla [date],General Purchases
13,pos purchase [garage] shelly conv c [date],General Purchases
17,pos purchase [restaurant] [date],Eating Out
22,pos purchase [restaurant] [location] [date],Eating Out
37,pos purchase [restaurant] [location] [date],Eating Out
38,pos purchase [restaurant] [date],Eating Out
39,pos purchase [restaurant] [date],Eating Out
45,pos purchase [restaurant] [date],Eating Out
50,pos purchase [restaurant] [date],Eating Out
