## Business case:- 
*   The goal of this assignment is to develop an AI-driven system that automates the 
categorization of bank transactions. The categorization should leverage historical data from past 
transactions and payee information. If historical data fails to suggest an appropriate category, 
the system should use OpenAI's APIs to generate an appropriate classification.

#### Domain :- 
* Finance

### Domain Analysis

In [133]:
import pandas as pd # use for data cleaning
import numpy as np # use for numerical function

In [134]:
# load the json file
data = pd.read_json("Bank Trasnactions.json") 
data

Unnamed: 0,Amount,Bank,Description,Date,Type
0,290.33,Chase Property #0184,AIRBNB PAYMENTS AM76GKUXMO PPD ID: XXXXXX1428,2024-06-21,Income
1,1000.00,ANZ_Construction Loan_6823,ANZ M-BANKING FUNDS TFER TRANSF...,2024-11-03,Expense
2,137.35,American Express #2003,UNIQLO USA LLC NEW YORK NY XXXX2003,2024-11-18,Expense
3,66.14,American Express #2003,LIDS HOLDINGS INC. INDIANAPOLIS IN XXXX2003,2024-11-18,Expense
4,73.02,American Express #2003,Interest Charge on Pay Over Time Purchases XXX...,2024-11-25,Expense
...,...,...,...,...,...
357,7.05,Bank Australia#0902,Visa Live Group/8/1 Progress Aeastwood AU#5977...,2024-02-27,Expense
358,7.06,Bank Australia#0902,Visa Lees Farm Eastwood AU#5977439 (Ref.022800...,2024-02-28,Expense
359,7.24,Bank Australia#0902,Visa -Tonkotsuya Carlingford AU#5977439 (Ref.0...,2024-02-02,Expense
360,7.50,Bank Australia#0902,Visa Abdulaziz Radi Shellharbour AU#5977439 (R...,2024-02-29,Expense


In [475]:
description_list1 = data["Description"].tolist()
description_list1

['AIRBNB PAYMENTS AM76GKUXMO PPD ID: XXXXXX1428',
 'ANZ M-BANKING FUNDS TFER                TRANSFER 642683  TO  XXXXXXXX6546568',
 'UNIQLO USA LLC NEW YORK NY XXXX2003',
 'LIDS HOLDINGS INC. INDIANAPOLIS IN XXXX2003',
 'Interest Charge on Pay Over Time Purchases XXXX2003',
 'Interest Charge on Pay Over Time Purchases XXXX2003',
 'HEINEMANN - 8734 ARRMASCOT AU XXXX2003',
 'DAVID JONES MACQUARISYDNEY AU XXXX2003',
 'D J*MW-ONLINE XXX-XXX-2378 NJ XXXX2003',
 'COMCAST CABLE COMM 800-COMCAST CO XXXX2003',
 'BING LEE ELECTRICS PSYDNEY AU XXXX2003',
 'ATP MEDIA WIMBLEDON GB XXXX2003',
 'AplPay MACQUARIE NORTH RYDE AU XXXX2003',
 'TINA MAIDS COLORADO SPRINGS CO XXXX2003',
 'AMAZON MARKETPLACE NAMZN.COM/BILL WA XXXX2011',
 'MACQUARIE NORTH RYDE AU XXXX2003',
 'PARAMOUNT+ WEST HOLLYWOO CA XXXX2003',
 'P.SKOOL.COM/CNMNT EL SEGUNDO CA XXXX2003',
 'MY SOLO 401K FINANCCARLSBAD CA XXXX2003',
 'MOUNTAIN VIEW ELECTRLIMON CO XXXX2003',
 'MOUNTAIN VIEW ELECTRLIMON CO XXXX2003',
 'MONARCH MONEY APP COVIN

**Attribute Inforamtion:-**

**Amount:** The numerical value associated with a transaction or an entry, typically representing the amount of money involved.

**Bank:** The name of the financial institution or bank associated with the transaction.

**Description:** A brief explanation or details about the transaction, such as what the payment was for (e.g., "Payment for services" or "Deposit").

**Date:** The date when the transaction occurred, often formatted as YYYY-MM-DD or MM/DD/YYYY, depending on the region.

**Type:** This likely indicates the nature of the transaction. It could be values such as:

In [274]:
# load the json file
data1 = pd.read_json("Charts of Accounts.json")
data1

Unnamed: 0,AccountType,Classification,FullyQualifiedName,Name
0,Credit Card,Liability,AAdvantage Aviator,AAdvantage Aviator
1,Accounts Payable,Liability,Accounts Payable,Accounts Payable
2,Bank,Asset,ANZ_Land Loan Offset_8143,ANZ_Land Loan Offset_8143
3,Income,Revenue,RENTS:Rent Income,Rent Income
4,Expense,Expense,Repair & Maintenance - Aud,Repair & Maintenance - Aud
...,...,...,...,...
141,Expense,Expense,Tax Payment,Tax Payment
142,Bank,Asset,Bank Australia 0054,Bank Australia 0054
143,Bank,Asset,Bank Australia 0903,Bank Australia 0903
144,Bank,Asset,Bank Australia 5523,Bank Australia 5523


In [538]:
data1["AccountType"].unique()

array(['Credit Card', 'Accounts Payable', 'Bank', 'Income', 'Expense',
       'Equity', 'Other Current Asset', 'Other Expense',
       'Accounts Receivable', 'Cost of Goods Sold', 'Other Income',
       'Other Asset', 'Other Current Liability'], dtype=object)

**Attribute Information:-**

**AccountType:** Defines the type of account (e.g., Checking, Savings).

**Classification:** Categorizes the account or transaction (e.g., Personal, Business).

**FullyQualifiedName:** Provides a precise and unique identifier for the account or entity.
    
**Name:** Represents the name of the account holder or entity.

In [120]:
# load the json file
data2 = pd.read_json("Payee.json")
data2

Unnamed: 0,Name,Account 1,Account 2,Account 3
0,SOUNDHOUND AI INC,Investment Accounts,,
1,Nextera Energy,Investment Accounts,Dividend,
2,Vanguard,Investment Accounts:Vanguard,,
3,Superprof,"Household:Health,Fitness,Personal",,
4,Haaretz,Dues and Subscriptions,,
...,...,...,...,...
501,Strike Bowling Castle,,,
502,Safety Wing,,,
503,Boost Juice,,,
504,Trybooking,,,


**Attribute Information:-**

**Name:** Represents the name of the account holder or the entity. This could be the name of an individual or business associated with the accounts.
    
**Account 1:** Could represent a checking account, savings account, or any other type of account.
    
**Account 2:** Another account, perhaps a different type or used for a different purpose (e.g., business or personal).

**Account 3:** Yet another account, which could be linked to a separate financial activity, like an investment account or a credit account.

In [122]:
data2.replace("", "Not Found", inplace=True) # use dor replace null value in not found

In [123]:
data2

Unnamed: 0,Name,Account 1,Account 2,Account 3
0,SOUNDHOUND AI INC,Investment Accounts,Not Found,Not Found
1,Nextera Energy,Investment Accounts,Dividend,Not Found
2,Vanguard,Investment Accounts:Vanguard,Not Found,Not Found
3,Superprof,"Household:Health,Fitness,Personal",Not Found,Not Found
4,Haaretz,Dues and Subscriptions,Not Found,Not Found
...,...,...,...,...
501,Strike Bowling Castle,Not Found,Not Found,Not Found
502,Safety Wing,Not Found,Not Found,Not Found
503,Boost Juice,Not Found,Not Found,Not Found
504,Trybooking,Not Found,Not Found,Not Found


## Feature Engineering

* Feature Engineering is the process of using domain knowledge, creativity, and data manipulation techniques to transform raw data into features (or variables) that can improve the performance of machine learning models.

* In simple terms, it involves:

1. Creating new features from existing data (e.g., combining two columns, extracting parts of a string, creating time-based features like day of the week).
2. Transforming features to better represent the underlying patterns (e.g., normalizing, scaling, or encoding categorical variables).
3. Selecting important features to improve model accuracy and reduce overfitting.

## Data Preprocessing

* Data Preprocessing is the process of preparing raw data for analysis or machine learning by cleaning, transforming, and organizing it in a way that enhances the accuracy and efficiency of models. It typically involves the following steps:

In [207]:
# merge the data
import pandas as pd

# Load datasets
df1 = pd.read_json("Charts of Accounts.json")  # First dataset
df2 = pd.read_json("Payee.json")  # Second dataset

# Perform row-wise merge (only matching rows) based on a common column (e.g., "ID")
merged_df = pd.merge(df1, df2, on="Name", how="outer")  # Inner join keeps only matches

# Save the merged file
merged_df.to_csv("Merged_Dataset.csv", index=False)

print("✅ Merging complete! Only matching rows are retained.")


✅ Merging complete! Only matching rows are retained.


In [208]:
dff = pd.read_csv("Merged_Dataset.csv") # load the csv file
dff

Unnamed: 0,AccountType,Classification,FullyQualifiedName,Name,Account 1,Account 2,Account 3
0,Credit Card,Liability,AAdvantage Aviator,AAdvantage Aviator,,,
1,Accounts Payable,Liability,Accounts Payable,Accounts Payable,,,
2,Bank,Asset,ANZ_Land Loan Offset_8143,ANZ_Land Loan Offset_8143,,,
3,Income,Revenue,RENTS:Rent Income,Rent Income,,,
4,Expense,Expense,Repair & Maintenance - Aud,Repair & Maintenance - Aud,,,
...,...,...,...,...,...,...,...
646,,,,Strike Bowling Castle,,,
647,,,,Safety Wing,,,
648,,,,Boost Juice,,,
649,,,,Trybooking,,,


In [226]:
# matched_name or Account match value based on Description 
import pandas as pd

# Load Bank Transactions (JSON)
data = pd.read_json("Bank Trasnactions.json")

# Load Payee Data (CSV) - Make sure it contains "Name" and "Account 1"
data2 = pd.read_csv("Payee.csv")  

# Handle missing values
data["Description"] = data["Description"].fillna("Not Found")  # Prevent NaN issues
data2["Name"] = data2["Name"].fillna("Not Found")  
data2["Account 1"] = data2["Account 1"].fillna("Not Found")  # Ensure Account 1 has values

# Convert Name column into a dictionary {Name: Account 1} for fast lookup
name_to_account = dict(zip(data2["Name"], data2["Account 1"]))

# Function to find a matching Name and return Account 1
def find_match_and_account(description, reference_dict):
    for name, account in reference_dict.items():
        if name.lower() in description.lower():  # Case insensitive matching
            return name, account  # Return matched name and its corresponding Account 1
    return "No Found", "Not Found"  # Default if no match is found

# Apply function to find matches
data[["Matched_Name", "Account 1"]] = data["Description"].apply(
    lambda x: pd.Series(find_match_and_account(x, name_to_account))
)

# Display the result
print(data)

# Save results to a new file
data.to_csv("Categorized_Transactions.csv", index=False)
print("✅ Process completed! Results saved to Categorized_Transactions.csv")

      Amount                        Bank  \
0     290.33        Chase Property #0184   
1    1000.00  ANZ_Construction Loan_6823   
2     137.35      American Express #2003   
3      66.14      American Express #2003   
4      73.02      American Express #2003   
..       ...                         ...   
357     7.05         Bank Australia#0902   
358     7.06         Bank Australia#0902   
359     7.24         Bank Australia#0902   
360     7.50         Bank Australia#0902   
361   496.15         Bank Australia#0902   

                                           Description       Date     Type  \
0        AIRBNB PAYMENTS AM76GKUXMO PPD ID: XXXXXX1428 2024-06-21   Income   
1    ANZ M-BANKING FUNDS TFER                TRANSF... 2024-11-03  Expense   
2                  UNIQLO USA LLC NEW YORK NY XXXX2003 2024-11-18  Expense   
3          LIDS HOLDINGS INC. INDIANAPOLIS IN XXXX2003 2024-11-18  Expense   
4    Interest Charge on Pay Over Time Purchases XXX... 2024-11-25  Expense   
.. 

In [290]:
df = pd.read_csv("Categorized_Transactions.csv") # load the CSV File
df.iloc[:,2:]

Unnamed: 0,Description,Date,Type,Matched_Name,Account 1
0,AIRBNB PAYMENTS AM76GKUXMO PPD ID: XXXXXX1428,2024-06-21,Income,AirBnB,Rental Income
1,ANZ M-BANKING FUNDS TFER TRANSF...,2024-11-03,Expense,No Found,Not Found
2,UNIQLO USA LLC NEW YORK NY XXXX2003,2024-11-18,Expense,Uniqlo,"Household:Health,Fitness,Personal"
3,LIDS HOLDINGS INC. INDIANAPOLIS IN XXXX2003,2024-11-18,Expense,LIDS,"Household:Health,Fitness,Personal"
4,Interest Charge on Pay Over Time Purchases XXX...,2024-11-25,Expense,No Found,Not Found
...,...,...,...,...,...
357,Visa Live Group/8/1 Progress Aeastwood AU#5977...,2024-02-27,Expense,Apple,Not Found
358,Visa Lees Farm Eastwood AU#5977439 (Ref.022800...,2024-02-28,Expense,Apple,Not Found
359,Visa -Tonkotsuya Carlingford AU#5977439 (Ref.0...,2024-02-02,Expense,Tonkotsuya,Meals and Entertainment
360,Visa Abdulaziz Radi Shellharbour AU#5977439 (R...,2024-02-29,Expense,Apple,Not Found


In [286]:
# drop the invalid Column
dff2 = df.drop(["Amount","Bank","Date","Type"],axis=1)
dff2
cv =  dff2.to_csv("Hello.csv")
cv

In [287]:
cv1 = pd.read_csv("Hello.csv") # load the csv file
cv1

Unnamed: 0.1,Unnamed: 0,Description,Matched_Name,Account 1
0,0,AIRBNB PAYMENTS AM76GKUXMO PPD ID: XXXXXX1428,AirBnB,Rental Income
1,1,ANZ M-BANKING FUNDS TFER TRANSF...,No Found,Not Found
2,2,UNIQLO USA LLC NEW YORK NY XXXX2003,Uniqlo,"Household:Health,Fitness,Personal"
3,3,LIDS HOLDINGS INC. INDIANAPOLIS IN XXXX2003,LIDS,"Household:Health,Fitness,Personal"
4,4,Interest Charge on Pay Over Time Purchases XXX...,No Found,Not Found
...,...,...,...,...
357,357,Visa Live Group/8/1 Progress Aeastwood AU#5977...,Apple,Not Found
358,358,Visa Lees Farm Eastwood AU#5977439 (Ref.022800...,Apple,Not Found
359,359,Visa -Tonkotsuya Carlingford AU#5977439 (Ref.0...,Tonkotsuya,Meals and Entertainment
360,360,Visa Abdulaziz Radi Shellharbour AU#5977439 (R...,Apple,Not Found


## Model Selection

* Model Selection is the process of choosing the best machine learning model from a set of candidate models for a given task, based on criteria like performance, efficiency, and suitability for the problem. This step is critical because the choice of model directly affects the quality of predictions and the accuracy of your results.

## RandomForest Classification Algorithm 

* Random Forest Classification is a powerful ensemble learning algorithm used for solving classification problems. It works by constructing a collection of decision trees during the training phase, where each tree is trained on a random subset of the data, using a technique called bootstrapping. This approach ensures that each tree is exposed to slightly different data, promoting diversity among the trees and preventing overfitting. Additionally, at each split in the trees, a random subset of features is selected, further enhancing the model's robustness. When making predictions, the model aggregates the outputs of all trees through majority voting, where the class predicted by most trees is chosen as the final prediction.
  
* Random Forest is known for its high accuracy, ability to handle large datasets with many features, and its resilience to overfitting, making it suitable for a wide range of classification tasks, such as medical diagnosis, customer segmentation, and image classification. However, while it offers high performance, it can be computationally expensive and difficult to interpret due to the complexity of combining multiple decision trees.

In [246]:
import pandas as pd # use for Data manipalution
import re # use for reguler exprression
from sklearn.model_selection import train_test_split # use for train and test data
from sklearn.feature_extraction.text import TfidfVectorizer # use for categorical to numerical vectore
from sklearn.ensemble import RandomForestClassifier # import RM Algorithm
from sklearn.pipeline import Pipeline # create a custom pipline 
from sklearn.metrics import accuracy_score # use for model evaluation

# Load dataset (replace with actual file)
file_path = "Categorized_Transactions.csv"
data = pd.read_csv(file_path)

# Ensure required columns exist
required_columns = ["Description", "Matched_Name", "Account 1"]
for col in required_columns:
    if col not in data.columns:
        raise ValueError(f"Missing required column: {col} in the dataset")

# Fill missing values
data["Description"] = data["Description"].fillna("Unknown")
data["Matched_Name"] = data["Matched_Name"].fillna("Not Found")
data["Account 1"] = data["Account 1"].fillna("Not Found")

# Clean text (Remove numbers, special characters)
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return text.strip()

data["Cleaned_Description"] = data["Description"].apply(clean_text)

# Split dataset
X_train, X_test, y_train_name, y_test_name = train_test_split(data["Cleaned_Description"], data["Matched_Name"], test_size=0.2, random_state=42)
X_train, X_test, y_train_acc, y_test_acc = train_test_split(data["Cleaned_Description"], data["Account 1"], test_size=0.2, random_state=42)

# Create TF-IDF Vectorizer + Random Forest Pipeline
model_name = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1,2))),
    ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])

model_acc = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1,2))),
    ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train models
print("Training 'Matched_Name' model...")
model_name.fit(X_train, y_train_name)

print("Training 'Account 1' model...")
model_acc.fit(X_train, y_train_acc)

# Predict on test set
y_pred_name = model_name.predict(X_test)
y_pred_acc = model_acc.predict(X_test)

# Evaluate accuracy
print(f"'Matched_Name' Accuracy: {accuracy_score(y_test_name, y_pred_name):.2f}")
print(f"'Account 1' Accuracy: {accuracy_score(y_test_acc, y_pred_acc):.2f}")

# Function to predict new descriptions
def classify_transaction(description):
    cleaned_desc = clean_text(description)
    matched_name_pred = model_name.predict([cleaned_desc])[0]
    account_pred = model_acc.predict([cleaned_desc])[0]
    return matched_name_pred, account_pred

# Example Usage
description = "UNIQLO USA LLC NEW YORK NY XXXX2003"
matched_name, account_category = classify_transaction(description)
print(f"Predicted Matched_Name: {matched_name}")
print(f"Predicted Account 1: {account_category}")

Training 'Matched_Name' model...
Training 'Account 1' model...
'Matched_Name' Accuracy: 0.77
'Account 1' Accuracy: 0.81
Predicted Matched_Name: Uniqlo
Predicted Account 1: Household:Health,Fitness,Personal


**Analysis Model:-**
* The output of the Random Forest model shows the training progress for two different tasks: predicting the Matched_Name and Account 1. The Matched_Name model achieved an accuracy of 77%, indicating that it correctly predicted the matched name in 77% of the cases. This suggests that the model is performing reasonably well but may have some room for improvement, especially in handling edge cases or rare examples. On the other hand, the Account 1 model performed slightly better with an accuracy of 81%, meaning it correctly classified the account in 81% of the cases. This indicates that the Account 1 model is likely handling the classification task more effectively, possibly due to clearer data patterns or more relevant features. However, in both models, there may still be opportunities for fine-tuning or optimizing features to further enhance performance.

## Named entity recognition Algorithm :- NER

* Named Entity Recognition (NER) is a natural language processing technique that identifies and classifies entities in text, such as names of people, organizations, locations, dates, and other specific terms. It helps extract structured information from unstructured text. NER models, often based on machine learning or deep learning algorithms, analyze the context of words to accurately detect entities. This technique is widely used in applications like information retrieval, document classification, and sentiment analysis, where understanding key entities is essential for processing and categorizing data.

In [267]:
import pandas as pd # use for data manipalution
import re # use for reguler expression
import spacy # use for NER 
from sklearn.model_selection import train_test_split # use for train and test data check
from sklearn.feature_extraction.text import TfidfVectorizer # use for categorical to numerical
from sklearn.ensemble import RandomForestClassifier # import RM model
from sklearn.pipeline import Pipeline # create custome pipline 
from sklearn.metrics import accuracy_score # check the model evaluation

# Load spaCy's pre-trained NER model
nlp = spacy.load("en_core_web_sm")

# Load dataset (replace with actual file)
file_path = "Categorized_Transactions.csv"
data = pd.read_csv(file_path)

# Ensure required columns exist
required_columns = ["Description", "Matched_Name", "Account 1"]
for col in required_columns:
    if col not in data.columns:
        raise ValueError(f"Missing required column: {col} in the dataset")

# Fill missing values
data["Description"] = data["Description"].fillna("Unknown")
data["Matched_Name"] = data["Matched_Name"].fillna("Not Found")
data["Account 1"] = data["Account 1"].fillna("Not Found")

# Clean text (Remove numbers, special characters)
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return text.strip()

# Function to extract named entities from the description using spaCy
def extract_entities(text):
    doc = nlp(text)  # Process the text with spaCy's NER model
    entities = []
    for ent in doc.ents:
        entities.append(ent.text)  # Collect the text of each named entity
    return " ".join(entities)  # Combine entities into one string for feature extraction

data["Cleaned_Description"] = data["Description"].apply(clean_text)
data["Extracted_Entities"] = data["Cleaned_Description"].apply(extract_entities)

# Use either the original description or the extracted entities
# Here, we are using the extracted entities as a feature for classification
X_train, X_test, y_train_name, y_test_name = train_test_split(data["Extracted_Entities"], data["Matched_Name"], test_size=0.2, random_state=42)
X_train, X_test, y_train_acc, y_test_acc = train_test_split(data["Extracted_Entities"], data["Account 1"], test_size=0.2, random_state=42)

# Create TF-IDF Vectorizer + Random Forest Pipeline for Matched_Name and Account 1
model_name = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1,2))),
    ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])

model_acc = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1,2))),
    ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train models
print("Training 'Matched_Name' model...")
model_name.fit(X_train, y_train_name)

print("Training 'Account 1' model...")
model_acc.fit(X_train, y_train_acc)

# Predict on test set
y_pred_name = model_name.predict(X_test)
y_pred_acc = model_acc.predict(X_test)

# Evaluate accuracy
print(f"'Matched_Name' Accuracy: {accuracy_score(y_test_name, y_pred_name):.2f}")
print(f"'Account 1' Accuracy: {accuracy_score(y_test_acc, y_pred_acc):.2f}")

# Function to predict new descriptions
def classify_transaction(description):
    cleaned_desc = clean_text(description)
    # Extract named entities from the cleaned description using the NER model
    entities = extract_entities(cleaned_desc)
    matched_name_pred = model_name.predict([entities])[0]
    account_pred = model_acc.predict([entities])[0]
    return matched_name_pred, account_pred

# Example Usage
description = "ABC copmany buy the 10 cars"
matched_name, account_category = classify_transaction(description)
print(f"Predicted Matched_Name: {matched_name}")
print(f"Predicted Account 1: {account_category}")


Training 'Matched_Name' model...
Training 'Account 1' model...
'Matched_Name' Accuracy: 0.81
'Account 1' Accuracy: 0.89
Predicted Matched_Name: No Found
Predicted Account 1: Not Found


**Analysis Model:-**
* he output of the NER model shows the training results for two tasks: predicting the Matched_Name and Account 1. The Matched_Name model achieved an accuracy of 81%, indicating that it correctly identified the matched names in 81% of the cases during training. However, the prediction for the Matched_Name task was "No Found," suggesting that, in this instance, the model did not find a match in the given input text or that no relevant entity was detected. On the other hand, the Account 1 model performed better with an accuracy of 89%, meaning it correctly predicted the account category in 89% of the cases. However, the prediction for Account 1 was "Not Found," indicating that the model did not identify a relevant account category in this case. While both models show relatively high accuracy, the "Not Found" and "No Found" predictions suggest that the input data might not contain recognizable entities or the models need further refinement or more context to improve entity detection in specific cases.

## LSTM :- Long Short Tearm Memory

* Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) designed to overcome the vanishing gradient problem, allowing it to capture long-term dependencies in sequential data. LSTMs use special units, called memory cells, to maintain information over long sequences by regulating the flow of information through gates (input, forget, and output gates). This ability makes LSTM models highly effective for tasks involving time series data, speech recognition, language modeling, and machine translation. They are particularly useful when the model needs to remember important context from earlier in the sequence to make accurate predictions later on.

In [273]:
import pandas as pd # use for data manipalution
import numpy as np # use for numrical function
import re # use for reguklar expression
import tensorflow as tf # use tensorflow library for deep lerning
from tensorflow.keras.preprocessing.text import Tokenizer # use for text classification probelm
from tensorflow.keras.preprocessing.sequence import pad_sequences # use for deep learning
from tensorflow.keras.models import Sequential # use sequenction model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout # call LSTM mdoel
from tensorflow.keras.callbacks import EarlyStopping # Use for performance
from sklearn.preprocessing import LabelEncoder # use for categorical to numerical 
from sklearn.model_selection import train_test_split # use for tarin and test data
from sklearn.metrics import classification_report # use for model evaluation

# Load dataset
file_path = "Categorized_Transactions.csv"
data = pd.read_csv(file_path)

# Ensure required columns exist
required_columns = ["Description", "Matched_Name", "Account 1"]
for col in required_columns:
    if col not in data.columns:
        raise ValueError(f"Missing required column: {col} in the dataset")

# Fill missing values
data["Description"] = data["Description"].fillna("Unknown")
data["Matched_Name"] = data["Matched_Name"].fillna("Not Found")
data["Account 1"] = data["Account 1"].fillna("Not Found")

# Clean text function
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'[^\w\s]', '', text)  # Remove special characters
    return text.strip()

data["Cleaned_Description"] = data["Description"].apply(clean_text)

# Tokenize text with OOV handling
MAX_VOCAB_SIZE = 5000  # Limit vocabulary size for better generalization
OOV_TOKEN = "<OOV>"  # Token for out-of-vocabulary words

tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE, oov_token=OOV_TOKEN)
tokenizer.fit_on_texts(data["Cleaned_Description"])

# Convert text to sequences
X = tokenizer.texts_to_sequences(data["Cleaned_Description"])
X = pad_sequences(X, padding='post')

# Encode target variables
label_enc_name = LabelEncoder()
data["Matched_Name_Encoded"] = label_enc_name.fit_transform(data["Matched_Name"])

label_enc_acc = LabelEncoder()
data["Account_Encoded"] = label_enc_acc.fit_transform(data["Account 1"])

# Train-test split
X_train, X_test, y_train_name, y_test_name = train_test_split(X, data["Matched_Name_Encoded"], test_size=0.2, random_state=42)
X_train, X_test, y_train_acc, y_test_acc = train_test_split(X, data["Account_Encoded"], test_size=0.2, random_state=42)

# LSTM Model with pre-trained FastText embeddings (enhanced architecture)
def build_lstm_model(output_dim, vocab_size, input_length):
    model = Sequential([
        Embedding(input_dim=vocab_size, output_dim=50, input_length=input_length, mask_zero=True),
        LSTM(200, return_sequences=False, dropout=0.2, recurrent_dropout=0.2),
        Dense(50, activation='relu'),
        Dense(output_dim, activation='softmax')  # Softmax for multi-class classification
    ])
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# Build models
vocab_size = min(len(tokenizer.word_index) + 1, MAX_VOCAB_SIZE)
input_length = X.shape[1]

model_name = build_lstm_model(len(label_enc_name.classes_), vocab_size, input_length)
model_acc = build_lstm_model(len(label_enc_acc.classes_), vocab_size, input_length)

# Early stopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Train 'Matched_Name' model
print("Training 'Matched_Name' model...")
model_name.fit(X_train, y_train_name, epochs=100, batch_size=32, validation_data=(X_test, y_test_name), callbacks=[early_stopping])

# Train 'Account 1' model
print("Training 'Account 1' model...")
model_acc.fit(X_train, y_train_acc, epochs=100, batch_size=32, validation_data=(X_test, y_test_acc), callbacks=[early_stopping])

# Save models
model_name.save("lstm_matched_name.h5")
model_acc.save("lstm_account.h5")

print("Models saved successfully!")

# Prediction Function with OOV Handling
def classify_transaction(description, threshold=50, top_n=3):
    cleaned_desc = clean_text(description)
    seq = tokenizer.texts_to_sequences([cleaned_desc])
    padded_seq = pad_sequences(seq, maxlen=X.shape[1])

    # Predict categories
    pred_name = model_name.predict(padded_seq)[0]
    pred_acc = model_acc.predict(padded_seq)[0]

    # Get top N predictions
    top_name_indices = np.argsort(pred_name)[-top_n:][::-1]
    top_acc_indices = np.argsort(pred_acc)[-top_n:][::-1]

    top_names = [(label_enc_name.inverse_transform([idx])[0], pred_name[idx] * 100) for idx in top_name_indices]
    top_accounts = [(label_enc_acc.inverse_transform([idx])[0], pred_acc[idx] * 100) for idx in top_acc_indices]

    # If top prediction confidence is below threshold, return "Not Found"
    best_matched_name, best_matched_conf = top_names[0]
    if best_matched_conf < threshold:
        best_matched_name = "Not Found"

    best_account, best_account_conf = top_accounts[0]
    if best_account_conf < threshold:
        best_account = "Not Found"

    return best_matched_name, best_matched_conf, best_account, best_account_conf, top_names, top_accounts

# Example Usage
description = "Visa -Tonkotsuya Carlingford AU#5977439"
matched_name, matched_name_conf, account_category, account_conf, top_names, top_accounts = classify_transaction(description)

# Print results
print(f"Predicted Matched_Name: {matched_name} ({matched_name_conf:.2f}%)")
print(f"Predicted Account 1: {account_category} ({account_conf:.2f}%)")

print("\nTop Matched Names:")
for name, conf in top_names:
    print(f"- {name} ({conf:.2f}%)")

print("\nTop Account Types:")
for acc, conf in top_accounts:
    print(f"- {acc} ({conf:.2f}%)")

# Evaluate models
y_pred_name = model_name.predict(X_test)
y_pred_acc = model_acc.predict(X_test)

# Convert predictions back to labels
y_pred_name_labels = np.argmax(y_pred_name, axis=1)
y_pred_acc_labels = np.argmax(y_pred_acc, axis=1)

print("\nClassification Report for 'Matched_Name':")
print(classification_report(y_test_name, y_pred_name_labels))

print("\nClassification Report for 'Account 1':")
print(classification_report(y_test_acc, y_pred_acc_labels))

Training 'Matched_Name' model...
Epoch 1/100




[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 104ms/step - accuracy: 0.3732 - loss: 3.4484 - val_accuracy: 0.3973 - val_loss: 3.0392
Epoch 2/100
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step - accuracy: 0.3436 - loss: 2.7560 - val_accuracy: 0.3973 - val_loss: 2.0517
Epoch 3/100
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 47ms/step - accuracy: 0.3432 - loss: 2.1865 - val_accuracy: 0.6301 - val_loss: 1.8901
Epoch 4/100
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 47ms/step - accuracy: 0.6537 - loss: 1.8368 - val_accuracy: 0.6301 - val_loss: 1.6999
Epoch 5/100
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 47ms/step - accuracy: 0.6092 - loss: 1.7218 - val_accuracy: 0.6575 - val_loss: 1.5537
Epoch 6/100
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 47ms/step - accuracy: 0.6886 - loss: 1.4087 - val_accuracy: 0.6849 - val_loss: 1.4237
Epoch 7/100
[1m10/10[0m [32m━━━━━━━━



Models saved successfully!
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 332ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 326ms/step
Predicted Matched_Name: Not Found (4.23%)
Predicted Account 1: Not Found (12.62%)

Top Matched Names:
- Amazon (4.23%)
- No Found (4.17%)
- Coles (3.88%)

Top Account Types:
- Office Supplies (12.62%)
- Meals and Entertainment (10.36%)
- Household:Health,Fitness,Personal (10.08%)
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step 
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step 

Classification Report for 'Matched_Name':
              precision    recall  f1-score   support

           2       0.55      1.00      0.71         6
           3       1.00      0.93      0.96        29
           4       0.00      0.00      0.00         1
           7       0.80      1.00      0.89         4
           8       0.00      0.00      0.00         1
           9       0.00      0.0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Random Data Genrate 

* To generate random data, we can create a dataset with synthetic values that resemble real-world data but aren't based on any actual sources. For instance, a dataset might include randomly generated values such as names, ages, cities, and product prices. Using libraries like NumPy or Pandas in Python, we can generate random numbers, select values from predefined lists, and assign random dates or categorical labels. For example, a random dataset might consist of 100 rows, where each row contains a name (e.g., "John Doe"), an age (e.g., 25), a city (e.g., "New York"), and a price (e.g., $150). Random data generation is commonly used for testing, simulating scenarios, and building models when real data isn't available or when testing the performance of algorithms on synthetic input.

In [291]:
import csv
import random

# Define a function to generate descriptions based on account and match name
def generate_description(match_name, account):
    description_templates = [
        f"{match_name} uses {account} to manage its operations efficiently.",
        f"The {account} account is a vital part of {match_name}'s financial strategy.",
        f"{match_name} utilizes {account} for handling its financial transactions and investments.",
        f"{match_name} processes payments through {account} for its ongoing projects.",
        f"The {account} account reflects payments and expenditures related to {match_name}'s services.",
        f"{match_name} settles all vendor payments through {account}, ensuring timely transaction processing.",
        f"By leveraging {account}, {match_name} ensures smooth cash flow for its business operations."
    ]
    
    return random.choice(description_templates)

# Define a list of match names and accounts
match_names = ["SOUNDHOUND AI INC", "Nextera Energy", "Vanguard", "Superprof", "Haaretz", "Myer-AUD", "BANNERMAN ENERGY LTD", "Farha Roofing", "QuickBooks Payments"]
accounts = ["AAdvantage Aviator", "Accounts Payable", "ANZ_Land Loan Offset_8143", "Rent Income", "Repair & Maintenance - AUD", "Repairs", "Appliances", "Door Lock & Keys", "Door Lock, Key & Materials"]

# Generate 1500 rows of data
data = [["Match Name", "Account", "Description"]]  # Header row

for _ in range(1500):
    match_name = random.choice(match_names)
    account = random.choice(accounts)
    description = generate_description(match_name, account)
    data.append([match_name, account, description])

# Specify the CSV filename
filename = "match_accounts_descriptions_1500.csv"

# Writing to CSV
with open(filename, mode='w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(data)

print(f"CSV file '{filename}' has been created with 1500 rows.")


CSV file 'match_accounts_descriptions_1500.csv' has been created with 1500 rows.


In [294]:
g1 = pd.read_csv("match_accounts_descriptions_1500.csv") # load the CSV File
g1

Unnamed: 0,Match Name,Account,Description
0,QuickBooks Payments,Rent Income,QuickBooks Payments utilizes Rent Income for h...
1,Nextera Energy,Door Lock & Keys,"By leveraging Door Lock & Keys, Nextera Energy..."
2,SOUNDHOUND AI INC,Repairs,"By leveraging Repairs, SOUNDHOUND AI INC ensur..."
3,Myer-AUD,AAdvantage Aviator,Myer-AUD uses AAdvantage Aviator to manage its...
4,Myer-AUD,Repair & Maintenance - AUD,The Repair & Maintenance - AUD account reflect...
...,...,...,...
1495,Vanguard,Appliances,Vanguard uses Appliances to manage its operati...
1496,Myer-AUD,"Door Lock, Key & Materials","By leveraging Door Lock, Key & Materials, Myer..."
1497,Superprof,Repairs,The Repairs account reflects payments and expe...
1498,Superprof,AAdvantage Aviator,The AAdvantage Aviator account is a vital part...


In [302]:
g2 = g1.rename(columns={"Match Name":"Matched_Name","Account":"Account 1"}) # use for rename the column anme
g2

Unnamed: 0,Matched_Name,Account 1,Description
0,QuickBooks Payments,Rent Income,QuickBooks Payments utilizes Rent Income for h...
1,Nextera Energy,Door Lock & Keys,"By leveraging Door Lock & Keys, Nextera Energy..."
2,SOUNDHOUND AI INC,Repairs,"By leveraging Repairs, SOUNDHOUND AI INC ensur..."
3,Myer-AUD,AAdvantage Aviator,Myer-AUD uses AAdvantage Aviator to manage its...
4,Myer-AUD,Repair & Maintenance - AUD,The Repair & Maintenance - AUD account reflect...
...,...,...,...
1495,Vanguard,Appliances,Vanguard uses Appliances to manage its operati...
1496,Myer-AUD,"Door Lock, Key & Materials","By leveraging Door Lock, Key & Materials, Myer..."
1497,Superprof,Repairs,The Repairs account reflects payments and expe...
1498,Superprof,AAdvantage Aviator,The AAdvantage Aviator account is a vital part...


In [293]:
df4a

Unnamed: 0,Amount,Bank,Description,Date,Type,Matched_Name,Account 1
0,290.33,Chase Property #0184,AIRBNB PAYMENTS AM76GKUXMO PPD ID: XXXXXX1428,2024-06-21,Income,AirBnB,Rental Income
1,1000.00,ANZ_Construction Loan_6823,ANZ M-BANKING FUNDS TFER TRANSF...,2024-11-03,Expense,No Found,Not Found
2,137.35,American Express #2003,UNIQLO USA LLC NEW YORK NY XXXX2003,2024-11-18,Expense,Uniqlo,"Household:Health,Fitness,Personal"
3,66.14,American Express #2003,LIDS HOLDINGS INC. INDIANAPOLIS IN XXXX2003,2024-11-18,Expense,LIDS,"Household:Health,Fitness,Personal"
4,73.02,American Express #2003,Interest Charge on Pay Over Time Purchases XXX...,2024-11-25,Expense,No Found,Not Found
...,...,...,...,...,...,...,...
357,7.05,Bank Australia#0902,Visa Live Group/8/1 Progress Aeastwood AU#5977...,2024-02-27,Expense,Apple,Not Found
358,7.06,Bank Australia#0902,Visa Lees Farm Eastwood AU#5977439 (Ref.022800...,2024-02-28,Expense,Apple,Not Found
359,7.24,Bank Australia#0902,Visa -Tonkotsuya Carlingford AU#5977439 (Ref.0...,2024-02-02,Expense,Tonkotsuya,Meals and Entertainment
360,7.50,Bank Australia#0902,Visa Abdulaziz Radi Shellharbour AU#5977439 (R...,2024-02-29,Expense,Apple,Not Found


In [304]:
dfs = pd.concat([df4,g2],axis=0) # use for concate the two variable
dfs

Unnamed: 0,Amount,Bank,Description,Date,Type,Matched_Name,Account 1
0,290.33,Chase Property #0184,AIRBNB PAYMENTS AM76GKUXMO PPD ID: XXXXXX1428,2024-06-21,Income,AirBnB,Rental Income
1,1000.00,ANZ_Construction Loan_6823,ANZ M-BANKING FUNDS TFER TRANSF...,2024-11-03,Expense,No Found,Not Found
2,137.35,American Express #2003,UNIQLO USA LLC NEW YORK NY XXXX2003,2024-11-18,Expense,Uniqlo,"Household:Health,Fitness,Personal"
3,66.14,American Express #2003,LIDS HOLDINGS INC. INDIANAPOLIS IN XXXX2003,2024-11-18,Expense,LIDS,"Household:Health,Fitness,Personal"
4,73.02,American Express #2003,Interest Charge on Pay Over Time Purchases XXX...,2024-11-25,Expense,No Found,Not Found
...,...,...,...,...,...,...,...
1495,,,Vanguard uses Appliances to manage its operati...,,,Vanguard,Appliances
1496,,,"By leveraging Door Lock, Key & Materials, Myer...",,,Myer-AUD,"Door Lock, Key & Materials"
1497,,,The Repairs account reflects payments and expe...,,,Superprof,Repairs
1498,,,The AAdvantage Aviator account is a vital part...,,,Superprof,AAdvantage Aviator


In [305]:
dfs.columns # use for check the column name

Index(['Amount', 'Bank', 'Description', 'Date', 'Type', 'Matched_Name',
       'Account 1'],
      dtype='object')

In [400]:
dfd = dfs.drop(['Amount', 'Bank', 'Date', 'Type'],axis=1) #drop the invalid columns
dfd.iloc[1496][0]

  dfd.iloc[1496][0]


"The Repair & Maintenance - AUD account is a vital part of Vanguard's financial strategy."

In [311]:
cc = dfd.to_csv("ABC.csv") # data convert to csv file
cc

In [383]:
import pandas as pd # use for data manipalution
import numpy as np # use for numerical function
import re # use for regular expression
from sklearn.model_selection import train_test_split # use for train and test data
from sklearn.preprocessing import LabelEncoder # use for categoorical to numerical data
from tensorflow.keras.preprocessing.text import Tokenizer # use for word vise work
from tensorflow.keras.preprocessing.sequence import pad_sequences # use for data sequence vise work
from tensorflow.keras.models import Sequential, load_model # Import load_model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional # use lSTM model 
from tensorflow.keras.callbacks import EarlyStopping # reduce the overfiting problem solve
from tensorflow.keras.regularizers import l1_l2 # call the L1 or L2 model
from sklearn.metrics import classification_report # use for calssification probelm

# Load dataset
file_path = "ABC.csv"  # Replace with your actual file path
data = pd.read_csv(file_path)

# Data cleaning and preprocessing
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'[^\w\s]', '', text)  # Remove special characters
    return text.strip()

data["Cleaned_Description"] = data["Description"].apply(clean_text)

# Tokenizer (increased vocabulary size)
MAX_VOCAB_SIZE = 10000
OOV_TOKEN = "<OOV>"
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE, oov_token=OOV_TOKEN)
tokenizer.fit_on_texts(data["Cleaned_Description"])

# Label encoding
label_enc_name = LabelEncoder()
data["Matched_Name_Encoded"] = label_enc_name.fit_transform(data["Matched_Name"])

label_enc_acc = LabelEncoder()
data["Account_Encoded"] = label_enc_acc.fit_transform(data["Account 1"])

# Convert text to sequences
X = tokenizer.texts_to_sequences(data["Cleaned_Description"])
X = pad_sequences(X, padding='post', truncating='post')

# Train-test split
X_train, X_test, y_train_name, y_test_name = train_test_split(X, data["Matched_Name_Encoded"], test_size=0.2, random_state=42)
X_train, X_test, y_train_acc, y_test_acc = train_test_split(X, data["Account_Encoded"], test_size=0.2, random_state=42)

# Enhanced LSTM Model
def build_lstm_model(output_dim, vocab_size, input_length):
    model = Sequential([
        Embedding(input_dim=vocab_size, output_dim=100, input_length=input_length, mask_zero=True),
        Bidirectional(LSTM(128, return_sequences=True, dropout=0.3, recurrent_dropout=0.3)),
        Bidirectional(LSTM(64, dropout=0.3, recurrent_dropout=0.3)),
        Dense(64, activation='relu', kernel_regularizer=l1_l2(l1=1e-5, l2=1e-5)),
        Dropout(0.3),
        Dense(output_dim, activation='softmax')
    ])
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# Build models
vocab_size = min(len(tokenizer.word_index) + 1, MAX_VOCAB_SIZE)
input_length = X.shape[1]

model_name = build_lstm_model(len(label_enc_name.classes_), vocab_size, input_length)
model_acc = build_lstm_model(len(label_enc_acc.classes_), vocab_size, input_length)

# Early stopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train models
print("Training 'Matched_Name' model...")
model_name.fit(X_train, y_train_name, epochs=50, batch_size=32, validation_data=(X_test, y_test_name), callbacks=[early_stopping])

print("Training 'Account 1' model...")
model_acc.fit(X_train, y_train_acc, epochs=50, batch_size=32, validation_data=(X_test, y_test_acc), callbacks=[early_stopping])

# Save models
model_name.save("lstm_matched_name.h5")
model_acc.save("lstm_account.h5")

print("Models saved successfully!")


# --- Prediction Code ---

# Load models (in a separate script or after training)
model_name = load_model("lstm_matched_name.h5")
model_acc = load_model("lstm_account.h5")

# ... (Load data, clean_text function, tokenizer, label encoders - same as above)

# Get input length (from loaded model)
input_length = model_name.layers[0].input.shape[1]

# Prediction Function
def classify_transaction(description, threshold=50, top_n=3):
    cleaned_desc = clean_text(description)
    seq = tokenizer.texts_to_sequences([cleaned_desc])
    padded_seq = pad_sequences(seq, maxlen=input_length, padding='post')

    pred_name = model_name.predict(padded_seq)[0]
    pred_acc = model_acc.predict(padded_seq)[0]

    top_name_indices = np.argsort(pred_name)[-top_n:][::-1]
    top_acc_indices = np.argsort(pred_acc)[-top_n:][::-1]

    top_names = [(label_enc_name.inverse_transform([idx])[0], pred_name[idx] * 100) for idx in top_name_indices]
    top_accounts = [(label_enc_acc.inverse_transform([idx])[0], pred_acc[idx] * 100) for idx in top_acc_indices]

    best_matched_name, best_matched_conf = top_names[0]
    if best_matched_conf < threshold:
        best_matched_name = "Not Found"

    best_account, best_account_conf = top_accounts[0]
    if best_account_conf < threshold:
        best_account = "Not Found"

    return best_matched_name, best_matched_conf, best_account, best_account_conf, top_names, top_accounts


# Example Usage
description = "Payments related to Staff Salaries are processed through Vortex Technologies to maintain workforce satisfaction."
matched_name, matched_name_conf, account_category, account_conf, top_names, top_accounts = classify_transaction(description)

print(f"Predicted Matched_Name: {matched_name} ({matched_name_conf:.2f}%)")
print(f"Predicted Account 1: {account_category} ({account_conf:.2f}%)")

print("\nTop Matched Names:")
for name, conf in top_names:
    print(f"- {name} ({conf:.2f}%)")

print("\nTop Account Types:")
for acc, conf in top_accounts:
    print(f"- {acc} ({conf:.2f}%)")

# Evaluate models (on test set - after training)
y_pred_name = model_name.predict(X_test)
y_pred_acc = model_acc.predict(X_test)

y_pred_name_labels = np.argmax(y_pred_name, axis=1)
y_pred_acc_labels = np.argmax(y_pred_acc, axis=1)

print("\nClassification Report for 'Matched_Name':")
print(classification_report(y_test_name, y_pred_name_labels))

print("\nClassification Report for 'Account 1':")
print(classification_report(y_test_acc, y_pred_acc_labels))



Training 'Matched_Name' model...
Epoch 1/50
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m45s[0m 260ms/step - accuracy: 0.1047 - loss: 3.3880 - val_accuracy: 0.1635 - val_loss: 2.4537
Epoch 2/50
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 181ms/step - accuracy: 0.2597 - loss: 2.2951 - val_accuracy: 0.5737 - val_loss: 1.3380
Epoch 3/50
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 190ms/step - accuracy: 0.6212 - loss: 1.1944 - val_accuracy: 0.9196 - val_loss: 0.5250
Epoch 4/50
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 191ms/step - accuracy: 0.8568 - loss: 0.5708 - val_accuracy: 0.9598 - val_loss: 0.3037
Epoch 5/50
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 187ms/step - accuracy: 0.9419 - loss: 0.2956 - val_accuracy: 0.9437 - val_loss: 0.3318
Epoch 6/50
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 181ms/step - accuracy: 0.9269 - loss: 0.2573 - val_accuracy: 0.9625 - val_loss: 



Models saved successfully!




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3s/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2s/step
Predicted Matched_Name: Nextera Energy (96.39%)
Predicted Account 1: Appliances (99.25%)

Top Matched Names:
- Nextera Energy (96.39%)
- Myer-AUD (1.86%)
- Haaretz (0.82%)

Top Account Types:
- Appliances (99.25%)
- Not Found (0.47%)
- Dues and Subscriptions (0.15%)
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 73ms/step
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 79ms/step

Classification Report for 'Matched_Name':
              precision    recall  f1-score   support

           2       1.00      1.00      1.00         7
           3       0.81      1.00      0.89        29
           4       1.00      1.00      1.00        28
           7       0.00      0.00      0.00         1
           8       0.91      1.00      0.95        10
          12       0.00      0.00      0.00         1
          13       1.00

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


**Analysis Model:-**
* The model's performance shows strong results, with high prediction accuracies for both Matched_Name and Account 1. The Matched_Name prediction for "Nextera Energy" is highly accurate at 96.39%, indicating that the model correctly identified the entity in most cases. Other potential matches include "Myer-AUD" (1.86%) and "Haaretz" (0.82%), though these have much lower probabilities, suggesting they are less likely to be correct. For Account 1, the model predicted "Appliances" with an impressive accuracy of 99.25%, demonstrating its strong capability in classifying account types. The other top predictions include "Not Found" (0.47%) and "Dues and Subscriptions" (0.15%), which show a very low likelihood of being correct. Overall, the model performs excellently in identifying and classifying the correct entities and account types, with very high confidence in its predictions. However, the low-probability alternatives indicate that there may still be edge cases where the model could benefit from further training or data enhancement.

# Convert output JSON formate

In [484]:
import pandas as pd # use for data manipalution
import numpy as np # use for numerical function
import re # use for regular expression
from sklearn.model_selection import train_test_split # use for train and test data
from sklearn.preprocessing import LabelEncoder # use for categoorical to numerical data
from tensorflow.keras.preprocessing.text import Tokenizer # use for word vise work
from tensorflow.keras.preprocessing.sequence import pad_sequences # use for data sequence vise work
from tensorflow.keras.models import Sequential, load_model # Import load_model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional # use lSTM model 
from tensorflow.keras.callbacks import EarlyStopping # reduce the overfiting problem solve
from tensorflow.keras.regularizers import l1_l2 # call the L1 or L2 model
from sklearn.metrics import classification_report # use for calssification probelm
import json # use for output convert json form data

# Load dataset
file_path = "ABC.csv"  # Replace with your actual file path
data = pd.read_csv(file_path)

# Data cleaning and preprocessing
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'[^\w\s]', '', text)  # Remove special characters
    return text.strip()

data["Cleaned_Description"] = data["Description"].apply(clean_text)

# Tokenizer
MAX_VOCAB_SIZE = 10000
OOV_TOKEN = "<OOV>"
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE, oov_token=OOV_TOKEN)
tokenizer.fit_on_texts(data["Cleaned_Description"])

# Label encoding
label_enc_name = LabelEncoder()
data["Matched_Name_Encoded"] = label_enc_name.fit_transform(data["Matched_Name"])

label_enc_acc = LabelEncoder()
data["Account_Encoded"] = label_enc_acc.fit_transform(data["Account 1"])

# Convert text to sequences
X = tokenizer.texts_to_sequences(data["Cleaned_Description"])
X = pad_sequences(X, padding='post', truncating='post')

# Train-test split
X_train, X_test, y_train_name, y_test_name = train_test_split(X, data["Matched_Name_Encoded"], test_size=0.2, random_state=42)
X_train, X_test, y_train_acc, y_test_acc = train_test_split(X, data["Account_Encoded"], test_size=0.2, random_state=42)

# Enhanced LSTM Model
def build_lstm_model(output_dim, vocab_size, input_length):
    model = Sequential([
        Embedding(input_dim=vocab_size, output_dim=100, input_length=input_length, mask_zero=True),
        Bidirectional(LSTM(128, return_sequences=True, dropout=0.3, recurrent_dropout=0.3)),
        Bidirectional(LSTM(64, dropout=0.3, recurrent_dropout=0.3)),
        Dense(64, activation='relu', kernel_regularizer=l1_l2(l1=1e-5, l2=1e-5)),
        Dropout(0.3),
        Dense(output_dim, activation='softmax')
    ])
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# Build models
vocab_size = min(len(tokenizer.word_index) + 1, MAX_VOCAB_SIZE)
input_length = X.shape[1]

model_name = build_lstm_model(len(label_enc_name.classes_), vocab_size, input_length)
model_acc = build_lstm_model(len(label_enc_acc.classes_), vocab_size, input_length)

# Early stopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train models
print("Training 'Matched_Name' model...")
model_name.fit(X_train, y_train_name, epochs=50, batch_size=32, validation_data=(X_test, y_test_name), callbacks=[early_stopping])

print("Training 'Account 1' model...")
model_acc.fit(X_train, y_train_acc, epochs=50, batch_size=32, validation_data=(X_test, y_test_acc), callbacks=[early_stopping])

# Save models
model_name.save("lstm_matched_name.h5")
model_acc.save("lstm_account.h5")

print("Models saved successfully!")


# --- Prediction Code ---

# Load models 
model_name = load_model("lstm_matched_name.h5")
model_acc = load_model("lstm_account.h5")

# Get input length (from loaded model)
input_length = model_name.layers[0].input.shape[1]


# Prediction Function (modified for JSON output)
def classify_transaction(description, threshold=50):
    cleaned_desc = clean_text(description)
    seq = tokenizer.texts_to_sequences([cleaned_desc])
    padded_seq = pad_sequences(seq, maxlen=input_length, padding='post')

    pred_name = model_name.predict(padded_seq)[0]
    pred_acc = model_acc.predict(padded_seq)[0]

    best_name_index = np.argmax(pred_name)
    best_acc_index = np.argmax(pred_acc)

    best_matched_name = label_enc_name.inverse_transform([best_name_index])[0] if pred_name.size > 0 else "Not Found"
    best_account = label_enc_acc.inverse_transform([best_acc_index])[0] if pred_acc.size > 0 else "Not Found"

    best_matched_conf = pred_name[best_name_index] * 100 if pred_name.size > 0 else 0.0
    best_account_conf = pred_acc[best_acc_index] * 100 if pred_acc.size > 0 else 0.0

    if best_matched_conf < threshold:
        best_matched_name = "Not Found"

    if best_account_conf < threshold:
        best_account = "Not Found"

    results = {
        "description": description,
        "predicted_matched_name": f"{best_matched_name} ({best_matched_conf:.2f}%)",
        "predicted_account": f"{best_account} ({best_account_conf:.2f}%)",
    }

    return results


# Example Usage and JSON Output
description = "TechCorp uses Electricity Payment for handling operational costs and expenses."
results = classify_transaction(description)

# Save to JSON file
with open("predictions.json", "w") as f:
    json.dump(results, f, indent=4)

print("Predictions saved to predictions.json")



# --- Evaluation (Optional - if you want to include it) ---
y_pred_name = model_name.predict(X_test)
y_pred_acc = model_acc.predict(X_test)

y_pred_name_labels = np.argmax(y_pred_name, axis=1)
y_pred_acc_labels = np.argmax(y_pred_acc, axis=1)

report_name = classification_report(y_test_name, y_pred_name_labels, output_dict=True)
report_acc = classification_report(y_test_acc, y_pred_acc_labels, output_dict=True)

report_json_name = json.dumps(report_name, indent=4)
report_json_acc = json.dumps(report_acc, indent=4)

print("\nClassification Report for 'Matched_Name' (JSON):")
print(report_json_name)

print("\nClassification Report for 'Account 1' (JSON):")
print(report_json_acc)



Training 'Matched_Name' model...
Epoch 1/50
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m50s[0m 259ms/step - accuracy: 0.1149 - loss: 3.4889 - val_accuracy: 0.2735 - val_loss: 2.4270
Epoch 2/50
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 210ms/step - accuracy: 0.2188 - loss: 2.3603 - val_accuracy: 0.5684 - val_loss: 1.5997
Epoch 3/50
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 199ms/step - accuracy: 0.4897 - loss: 1.4768 - val_accuracy: 0.8713 - val_loss: 0.6653
Epoch 4/50
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 214ms/step - accuracy: 0.7853 - loss: 0.7377 - val_accuracy: 0.9410 - val_loss: 0.3939
Epoch 5/50
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 218ms/step - accuracy: 0.9076 - loss: 0.3781 - val_accuracy: 0.9678 - val_loss: 0.2999
Epoch 6/50
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 202ms/step - accuracy: 0.9599 - loss: 0.2057 - val_accuracy: 0.9651 - val_lo



Models saved successfully!




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4s/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3s/step
Predictions saved to predictions.json
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 54ms/step
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 54ms/step

Classification Report for 'Matched_Name' (JSON):
{
    "2": {
        "precision": 0.7777777777777778,
        "recall": 1.0,
        "f1-score": 0.875,
        "support": 7.0
    },
    "3": {
        "precision": 0.9354838709677419,
        "recall": 1.0,
        "f1-score": 0.9666666666666667,
        "support": 29.0
    },
    "4": {
        "precision": 1.0,
        "recall": 1.0,
        "f1-score": 1.0,
        "support": 28.0
    },
    "7": {
        "precision": 0.0,
        "recall": 0.0,
        "f1-score": 0.0,
        "support": 1.0
    },
    "8": {
        "precision": 0.7692307692307693,
        "recall": 1.0,
        "f1-score": 0.8695652173913043

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


**Analysis Model:-**
* The model's performance shows strong results, with high prediction accuracies for both Matched_Name and Account 1. The Matched_Name prediction for "Nextera Energy" is highly accurate at 96.39%, indicating that the model correctly identified the entity in most cases. Other potential matches include "Myer-AUD" (1.86%) and "Haaretz" (0.82%), though these have much lower probabilities, suggesting they are less likely to be correct. For Account 1, the model predicted "Appliances" with an impressive accuracy of 99.25%, demonstrating its strong capability in classifying account types. The other top predictions include "Not Found" (0.47%) and "Dues and Subscriptions" (0.15%), which show a very low likelihood of being correct. Overall, the model performs excellently in identifying and classifying the correct entities and account types, with very high confidence in its predictions. However, the low-probability alternatives indicate that there may still be edge cases where the model could benefit from further training or data enhancement.

In [None]:
for i in description_list1:
    results = classify_transaction(i)
    append_to_json(results)
    

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 83ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 83ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 125ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 119ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 123ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 73ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 58ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 182ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 121ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 151ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 68ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 75ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s

In [485]:
description_list1

['AIRBNB PAYMENTS AM76GKUXMO PPD ID: XXXXXX1428',
 'ANZ M-BANKING FUNDS TFER                TRANSFER 642683  TO  XXXXXXXX6546568',
 'UNIQLO USA LLC NEW YORK NY XXXX2003',
 'LIDS HOLDINGS INC. INDIANAPOLIS IN XXXX2003',
 'Interest Charge on Pay Over Time Purchases XXXX2003',
 'Interest Charge on Pay Over Time Purchases XXXX2003',
 'HEINEMANN - 8734 ARRMASCOT AU XXXX2003',
 'DAVID JONES MACQUARISYDNEY AU XXXX2003',
 'D J*MW-ONLINE XXX-XXX-2378 NJ XXXX2003',
 'COMCAST CABLE COMM 800-COMCAST CO XXXX2003',
 'BING LEE ELECTRICS PSYDNEY AU XXXX2003',
 'ATP MEDIA WIMBLEDON GB XXXX2003',
 'AplPay MACQUARIE NORTH RYDE AU XXXX2003',
 'TINA MAIDS COLORADO SPRINGS CO XXXX2003',
 'AMAZON MARKETPLACE NAMZN.COM/BILL WA XXXX2011',
 'MACQUARIE NORTH RYDE AU XXXX2003',
 'PARAMOUNT+ WEST HOLLYWOO CA XXXX2003',
 'P.SKOOL.COM/CNMNT EL SEGUNDO CA XXXX2003',
 'MY SOLO 401K FINANCCARLSBAD CA XXXX2003',
 'MOUNTAIN VIEW ELECTRLIMON CO XXXX2003',
 'MOUNTAIN VIEW ELECTRLIMON CO XXXX2003',
 'MONARCH MONEY APP COVIN

In [530]:
description = "The Repairs account is a vital part of Haaretz's financial strategy."
results = classify_transaction(description)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 87ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 107ms/step


In [531]:
results

{'description': "The Repairs account is a vital part of Haaretz's financial strategy.",
 'predicted_matched_name': 'Haaretz (99.89%)',
 'predicted_account': 'Repairs (99.97%)'}

In [534]:
# store the values
append_to_json(results)

# Store Output in JSON FIle

In [533]:
import json # call json model
import os # call os model

def append_to_json(new_result, filename='results.json'):
    """Appends a new result to a JSON file, handling file creation and list structure."""

    if not os.path.exists(filename):  # Check if the file exists
        with open(filename, 'w') as f:
            json.dump([new_result], f, indent=4)  # Create file with initial list

    else:
        with open(filename, 'r+') as f:  # Open in read and write mode ('r+')
            try:
                data = json.load(f)  # Load existing data
                if isinstance(data, list):  # Check if it's a list
                    data.append(new_result)  # Append the new result
                    f.seek(0)  # Go to the beginning of the file
                    json.dump(data, f, indent=4)  # Write the updated list
                    f.truncate()  # Remove any remaining old data (important!)
                else:
                    print(f"Warning: JSON file '{filename}' does not contain a list.  Overwriting with new list.")
                    json.dump([new_result], f, indent=4)  # Overwrite if it's not a list

            except json.JSONDecodeError:  # Handle potential JSON errors
                print(f"Warning: JSON file '{filename}' is corrupted. Overwriting with new list.")
                json.dump([new_result], f, indent=4)  # Overwrite if JSON is invalid


# Conclusion:

**Random Forest Classification:** Random Forest Classification is an ensemble method that builds multiple decision trees and combines their predictions through majority voting, resulting in improved accuracy and reduced overfitting. It works well with high-dimensional data and is robust against noise, making it ideal for classification tasks.

**Named Entity Recognition (NER):** NER is a natural language processing technique used to identify and classify entities such as names, organizations, locations, and dates in text. It is particularly useful for extracting structured information from unstructured text and is widely used in information extraction and document analysis.

**Long Short-Term Memory (LSTM):** LSTM is a type of recurrent neural network designed to capture long-term dependencies in sequential data, such as time series or text. By using memory cells and gates, LSTMs can retain important information over long sequences, making them effective for tasks like speech recognition, machine translation, and sentiment analysis.

# Report on Challenges faced:-

**Three Tables Not Correlated:** 
* When the tables aren’t directly related, merging them can be challenging. You may need to identify meaningful relationships or common features, and apply techniques like data merging, joining, or feature engineering to create a unified dataset for training.

**Extra Columns:** 
* Extra columns introduce noise, which can reduce model accuracy. Feature selection techniques help eliminate irrelevant or redundant columns, allowing the model to focus on the most important features that drive predictions.

**Predicting Two Output Variables:** 
* Predicting multiple variables requires handling multi-output prediction. You can either treat them as separate tasks or use models that can predict both outputs at once, such as multi-output regression or classification models.

**Output in JSON Format:** 
* Converting predictions into JSON format is useful for APIs or web applications. After generating predictions, Python’s json library can be used to structure and output the results in a JSON format for easier integration.

**Small Data:** 
* Small datasets limit model performance, leading to lower accuracy. Techniques like data augmentation or synthetic data generation can help enlarge the dataset, improving model generalization and reducing overfitting.

**Low Accuracy:** 
* Low accuracy often stems from poor feature selection or model limitations. Enhancing data quality, experimenting with different algorithms, or applying hyperparameter tuning can help boost model performance and accuracy.

In [463]:
import json
import numpy as np

# Save the arrays as JSON files
with open("label_enc_name_classes.json", "w") as f:
    json.dump(label_enc_name.classes_.tolist(), f)

with open("label_enc_acc_classes.json", "w") as f:
    json.dump(label_enc_acc.classes_.tolist(), f)


In [465]:

# Load the classes from JSON files and convert back to numpy arrays
with open("label_enc_name_classes.json", "r") as f:
    label_enc_name.classes_ = np.array(json.load(f))

with open("label_enc_acc_classes.json", "r") as f:
    label_enc_acc.classes_ = np.array(json.load(f))

In [469]:
# Convert to a Python list for easier readability
print(list(label_enc_name.classes_))
print(list(label_enc_acc.classes_))


['ATP Media', 'AirBnB', 'Amazon', 'Apple', 'BANNERMAN ENERGY LTD', 'BP', 'Bing Lee', 'Bonbons Bakery', 'Coles', 'Comcast', 'Costis Fresh Seafood', 'David Jones', 'Eat Street Parramatta', 'Farha Roofing', 'Football Australia Darlinghurst', 'Haaretz', 'Heinemann', 'Kmart', 'LIDS', 'Macquarie Centre', 'Monarch Money', 'My Health Food Shop', 'Myer', 'Myer-AUD', 'Nextera Energy', 'No Found', 'Parknpay', 'Petbarn', 'QuickBooks Payments', 'Roseville Bakehouse', 'Rusty Penny Brewing', 'SOUNDHOUND AI INC', 'Solanna Pty Ltd', 'Superprof', 'Target', 'Tina Maids Colorado Springs', 'Tonkotsuya', 'Trybooking', 'Uniqlo', 'Vanguard', 'White Horse Coffee', 'Woolworths Online']
['AAdvantage Aviator', 'ANZ_Land Loan Offset_8143', 'Accounts Payable', 'Appliances', 'Cleaning and Maintenance', 'Door Lock & Keys', 'Door Lock, Key & Materials', 'Dues and Subscriptions', 'Household:Fun/Meals/Entertainment', 'Household:Groceries', 'Household:Health,Fitness,Personal', 'Meals and Entertainment', 'Meals and Entert

In [470]:
data = pd.read_csv("ABC.csv")
print(data.columns)


Index(['Unnamed: 0', 'Description', 'Matched_Name', 'Account 1'], dtype='object')


In [473]:
data.iloc[1,1]

'ANZ M-BANKING FUNDS TFER                TRANSFER 642683  TO  XXXXXXXX6546568'

<p style="text-align: center; font-size: 30px; font-weight: bold;">Thank You</p>